Developed with MetaController and based on https://github.com/everpeace/kube-openmpi and https://github.com/kubeflow/mpi-operator.
This MPI Kubernetes Operator provides a Kubernetes native interface to building MPI clusters and running jobs.
First you must have MetaController:
make metacontroller
Next deploy the Operator:
make deploy
An MPI cluster relies on a base image that encapsulates the MPI application dependencies and facilitates the MPI communication. An example of this is the included mpibase
image, which can be built using:
make build_mpibase && make push_mpibase
You can use the default images on Docker Hub or you must ensure that you configure your own Docker registry details by setting appropriate values for:
PULL_SECRET = "gitlab-registry"
GITLAB_USER = you
REGISTRY_PASSWORD = your-registry-password
GITLAB_USER_EMAIL = "[email protected]"
CI_REGISTRY = gitlab.somewhere.com
CI_REPOSITORY = repository/uri
MPIBASE_IMAGE = $(CI_REGISTRY)/$(CI_REPOSITORY)/mpibase:latest
set in PrivateRules.mak
Launch the helloworld job:
make test
Once everything starts, the logs are available in the launcher
pod.
The CRD for MPIJobs has two parameters: replicas(int)
and daemons(boolean)
. Specifying only replicas
will leave it up to the scheduler where to place the worker pods on the cluster, but if in addition daemons
is set to true
(see mpi-test-demons.yaml) then the Pod AntiAffinity rules are applied and the Kubernetes scheduler will force the workers onto individual nodes - if available.
initContainers check availability of the workers, prior to executing the launcher
, so if any Pods are stuck in Pending
then they are dropped out of the worker list.