The cumulus from bouthilx

cumulus's Issues

Implement --summarize option for MONITOR command

Test with

Hades (qstat)
Helios (qstat/showq)
Graham (squeue)

Detect GPU failure on import and requeue job

(Copy of bouthilx/research#49)

Hades seems to have a faulty GPU several times a week. It is very frustrating because it empties the list of queued job by crashing them all on the fautty GPU. If this kind of error could be detected, would could requeue the job rather than marking it as failed. The will still cause a lot of useless operations, because the jobs will keep being thrown to the faulty GPU until a correct one is free.

Integrate code quality and continous integration apps

Later on:

Drop either CircleCI or Travis based on experience
Maybe drop codecov or codacy if one is enough

Add CANCEL command

Test with

Hades (qdel)
Helios (qdel but with qstat-showq)
Graham (scancel)

Implement Project to group different experiments

Just as Cumulus is a composite of Clusters, there should be a Project which is a composite of experiments Database

Turn ssh (paramiko) remote commands into sockets

Provide tools for job interruption

(Related to bouthilx/research#19)

Provide a way for jobs to mark themselves as interrupted. Maybe provide the interruption handler itself, with support for callbacks. The internal machinery would then take care of marking the job as interrupted in the database. The command deploy should look for both queued and interrupted jobs, with an option to force a selection of either of them (cumulus deploy [some options] --restrict-to QUEUED). The database backend will take care of converting the status from QUEUED to whatever the backend should use. Standard status to convert from are defined in AbstractDatabase.

Add SETUP command to install/update automatically cumulus to clusters along with experiments

Monitoring grouped by cluster or experiment

Use db[job_id] to filter monitor (qstat/squeue) and print jobs in the queue (R, Q, C, H) divided by experiments.

cumulus monitor --group-by [experiment|cluster]

Add support to install experiments with SETUP command

Test with benchmarks

Write torque-cluster setups

Use TLS/SSL wrapper for socket objects

Add DEPLOY command

Needs to be tested thoroughly.

Test cases:

Make lazy Cluster connections

Minimize the number of connections by avoided useless connections on deploy if there is no experiments to run.

Automatic local install for each experiment

(Copy of bouthilx/research#37)

Goals:

Minimize manipulation of code on clusters
Allow experiments to row on different code version simultaneously
Timeline:

Plan
Risks:

Duplicating sacred's tracking of code versioning
Add to much overhead in each experiments
Hard to debug installs

Test Slurm Scheduler class properly

Integrate Database interface

Needs

cumulus setup
None
cumulus deploy
status - To select jobs to submit
cumulus watch
job_id - To combine with squeue maybe?
status - To print
cluster - To print
node - To print
cumulus cancel
status - To update status to CANCELLED
job_id - To update status to CANCELLED
cumulus-particle local
repos - Copy repositories
job_id - Set job_id
cluster - Set cluster
node - Set node
status - Set status (or leave it to executed command - like sacred)

Backends

Default
Sacred

Find a way to conserve the connection information

Right now, the connections are done in pool.async_call so self.connected_hostname and self.sshtunnel only exists inside this child process. This means every operations on cluster in subsequent child processes will initiate a connection to the server too. Commands likes deploy and cancel are combinations of remote operations so they will make many useless server connection initiation.

Add database backend

Database is used both for cumulus' logs and for experiments.

There is 2 types of backends

Which DB to use; MongoDB, DynamoDB, SQL, etc.
Bridges to experiment DBs from experiment manager like Sacred or Sumatra. There will be no support for experiments not store in db.

Could support different backends for different bridges. Ex: cumulus' logs are store in MongoDB but experiments are with SQL.

NOTE: Only support MongoDB for now

Write slurm-cluster setups

Graham
Cedar
Mila
Khala

Script to move code + dataset to fast drive

(Copy of bouthilx/research#16)
(Related to #19)

File system is horribly slow in the lab. Better to keep stuff on local (H|S)DD. Should write a script to keep synched stuff on /data/lisatmp4 and /Tmp. Could apply to clusters too.

bouthilx / cumulus Goto Github PK

cumulus's People

Contributors

Watchers

cumulus's Issues

Needs

Backends

Recommend Projects

Recommend Topics

Recommend Org

Jobs