GithubHelp home page GithubHelp logo

kreczko / simple-htcondor-cluster-on-kubernetes Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alahiff/simple-htcondor-cluster-on-kubernetes

0.0 2.0 0.0 26 KB

Demonstration of a simple HTCondor cluster running on Kubernetes

Shell 100.00%

simple-htcondor-cluster-on-kubernetes's Introduction

A simple HTCondor cluster on Kubernetes

Setting up the cluster

Preparing the pool password

We will use a pool password (i.e. shared secret) to secure the HTCondor cluster. Create a random pool password with filename password in the /tmp directory:

docker run -it -v /tmp:/vol alahiff/htcondor-generate-password:latest

This will create a file /tmp/password. Create a secret from this new pool password:

kubectl create secret generic htcondor-pool-password --from-file=/tmp/password

Deploy the central manager, schedd and a worker node

Firstly deploy the central manager:

kubectl create -f htcondor-central-manager-service.yaml
kubectl create -f htcondor-central-manager-deployment.yaml

Once the central manager is running, i.e.

# kubectl get deployment htcondor-central-manager
NAME                       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
htcondor-central-manager   1         1         1            1           17h

deploy the schedd and worker node:

kubectl create -f htcondor-schedd-deployment.yaml
kubectl create -f htcondor-worker-deployment.yaml

After a little while you should see 3 running pods:

# kubectl get pods
NAME                                        READY     STATUS    RESTARTS   AGE
htcondor-central-manager-2538661822-1q8pl   1/1       Running   0          17h
htcondor-schedd-308862252-r6xkj             1/1       Running   0          16h
htcondor-worker-123133369-tdb70             1/1       Running   0          16h

Next steps

Check that all HTCondor components (collector, negotiator, schedd, and startd) are working correctly:

# kubectl exec htcondor-central-manager-2538661822-1q8pl -i -t -- condor_status -any
MyType             TargetType         Name

Collector          None               My Pool - htcondor-central-manager-253866
DaemonMaster       None               htcondor-central-manager-2538661822-1q8pl
Negotiator         None               htcondor-central-manager-2538661822-1q8pl
Scheduler          None               htcondor-schedd-308862252-r6xkj
DaemonMaster       None               htcondor-schedd-308862252-r6xkj
DaemonMaster       None               htcondor-worker-123133369-tdb70
Machine            Job                slot1@htcondor-worker-123133369-tdb70
Machine            Job                slot2@htcondor-worker-123133369-tdb70

Try scaling the number of worker nodes, e.g.

# kubectl scale --replicas 2 deployment/htcondor-worker
deployment "htcondor-worker" scaled

You should then see that there are 2 worker nodes:

# kubectl exec htcondor-central-manager-2538661822-1q8pl -i -t -- condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@htcondor-wor LINUX      X86_64 Unclaimed Idle      0.710 1895  0+00:00:04
slot2@htcondor-wor LINUX      X86_64 Unclaimed Idle      0.000 1895  0+00:00:29
slot1@htcondor-wor LINUX      X86_64 Unclaimed Idle      0.000 1895  0+16:25:51
slot2@htcondor-wor LINUX      X86_64 Unclaimed Idle      0.000 1895  0+16:25:07
                     Machines Owner Claimed Unclaimed Matched Preempting

        X86_64/LINUX        4     0       0         4       0          0

               Total        4     0       0         4       0          0

Try killing the central manager:

# kubectl delete pod htcondor-central-manager-2538661822-1q8pl
pod "htcondor-central-manager-2538661822-1q8pl" deleted

A new central manager pod will automatically be created:

# kubectl get deployments,pods
NAME                              DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/htcondor-central-manager   1         1         1            0           18h
deploy/htcondor-schedd            1         1         1            1           16h
deploy/htcondor-worker            2         2         2            2           16h

NAME                                           READY     STATUS              RESTARTS   AGE
po/htcondor-central-manager-2538661822-1q8pl   1/1       Terminating         0          18h
po/htcondor-central-manager-2538661822-g3ngm   0/1       ContainerCreating   0          5s
po/htcondor-schedd-308862252-r6xkj             1/1       Running             0          16h
po/htcondor-worker-123133369-lc41z             1/1       Running             0          30m
po/htcondor-worker-123133369-tdb70             1/1       Running             0          16h

Initially the new central manager won't know about the schedd or worker nodes:

# kubectl exec htcondor-central-manager-2538661822-g3ngm -i -t -- condor_status -any
MyType             TargetType         Name

Collector          None               My Pool - htcondor-central-manager-253866
DaemonMaster       None               htcondor-central-manager-2538661822-g3ngm
Negotiator         None               htcondor-central-manager-2538661822-g3ngm

but after a while the schedd and worker node(s) will update the new central manager:

# kubectl exec htcondor-central-manager-2538661822-g3ngm -i -t -- condor_status -any
MyType             TargetType         Name

Collector          None               My Pool - htcondor-central-manager-253866
DaemonMaster       None               htcondor-central-manager-2538661822-g3ngm
Negotiator         None               htcondor-central-manager-2538661822-g3ngm
Scheduler          None               htcondor-schedd-308862252-r6xkj
DaemonMaster       None               htcondor-schedd-308862252-r6xkj
DaemonMaster       None               htcondor-worker-123133369-lc41z
Machine            Job                slot1@htcondor-worker-123133369-lc41z
Machine            Job                slot2@htcondor-worker-123133369-lc41z
DaemonMaster       None               htcondor-worker-123133369-tdb70
Machine            Job                slot1@htcondor-worker-123133369-tdb70
Machine            Job                slot2@htcondor-worker-123133369-tdb70

Separated collector and negotiator

Instead of having a single central manager pod, consisting of both a collector and negotiator, we can separate the central manager into two pods. After creating a pool password as descibed above, we firstly create a service for the collector, then create deployments for the collector and negotiator pods:

kubectl create -f v2/htcondor-collector-service.yaml
kubectl create -f v2/htcondor-collector-deployment.yaml
kubectl create -f v2/htcondor-negotiator-deployment.yaml

Now create a schedd and worker node:

kubectl create -f  v2/htcondor-schedd-deployment.yaml
kubectl create -f  v2/htcondor-worker-deployment.yaml

You should see

kubectl get deployments,pods
NAME                         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/htcondor-collector    1         1         1            1           1h
deploy/htcondor-negotiator   1         1         1            1           1h
deploy/htcondor-schedd       1         1         1            1           1h
deploy/htcondor-worker       1         1         1            1           1h

NAME                                      READY     STATUS    RESTARTS   AGE
po/htcondor-collector-4216735924-d9nq5    1/1       Running   0          1h
po/htcondor-negotiator-1710181096-1stkb   1/1       Running   0          1h
po/htcondor-schedd-1411924804-9sdqr       1/1       Running   0          1h
po/htcondor-worker-706141080-z5qld        1/1       Running   0          1h

Note that there is no point scaling the number of collector pods with the configuration in use (this won't help with scaling or redunancy).

We can also check that all the expected HTCondor daemons are running:

kubectl exec htcondor-collector-4216735924-d9nq5 -i -t -- condor_status -any
MyType             TargetType         Name

Collector          None               My Pool - htcondor-collector-4216735924-d
DaemonMaster       None               htcondor-collector-4216735924-d9nq5
Negotiator         None               NEGOTIATOR
DaemonMaster       None               htcondor-negotiator-1710181096-1stkb
Scheduler          None               htcondor-schedd-1411924804-9sdqr
DaemonMaster       None               htcondor-schedd-1411924804-9sdqr
DaemonMaster       None               htcondor-worker-706141080-z5qld
Machine            Job                slot1@htcondor-worker-706141080-z5qld
Accounting         none               <none>

simple-htcondor-cluster-on-kubernetes's People

Contributors

alahiff avatar

Watchers

James Cloos avatar Luke Kreczko avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.