bouthilx / cumulus Goto Github PK
View Code? Open in Web Editor NEWManage a cloud of clusters
Manage a cloud of clusters
Test with
(Copy of bouthilx/research#49)
Hades seems to have a faulty GPU several times a week. It is very frustrating because it empties the list of queued job by crashing them all on the fautty GPU. If this kind of error could be detected, would could requeue the job rather than marking it as failed. The will still cause a lot of useless operations, because the jobs will keep being thrown to the faulty GPU until a correct one is free.
Later on:
Test with
Just as Cumulus
is a composite of Cluster
s, there should be a Project
which is a composite of experiments Database
(Related to bouthilx/research#19)
Provide a way for jobs to mark themselves as interrupted. Maybe provide the interruption handler itself, with support for callbacks. The internal machinery would then take care of marking the job as interrupted in the database. The command deploy should look for both queued and interrupted jobs, with an option to force a selection of either of them (cumulus deploy [some options] --restrict-to QUEUED
). The database backend will take care of converting the status from QUEUED to whatever the backend should use. Standard status to convert from are defined in AbstractDatabase.
Use db[job_id] to filter monitor (qstat/squeue) and print jobs in the queue (R, Q, C, H) divided by experiments.
cumulus monitor --group-by [experiment|cluster]
Needs to be tested thoroughly.
Test cases:
Minimize the number of connections by avoided useless connections on deploy if there is no experiments to run.
(Copy of bouthilx/research#37)
Goals:
Minimize manipulation of code on clusters
Allow experiments to row on different code version simultaneously
Timeline:
Plan
Risks:
Duplicating sacred's tracking of code versioning
Add to much overhead in each experiments
Hard to debug installs
cumulus setup
None
cumulus deploy
status - To select jobs to submit
cumulus watch
job_id - To combine with squeue maybe?
status - To print
cluster - To print
node - To print
cumulus cancel
status - To update status to CANCELLED
job_id - To update status to CANCELLED
cumulus-particle local
repos - Copy repositories
job_id - Set job_id
cluster - Set cluster
node - Set node
status - Set status (or leave it to executed command - like sacred)
Right now, the connections are done in pool.async_call so self.connected_hostname and self.sshtunnel only exists inside this child process. This means every operations on cluster in subsequent child processes will initiate a connection to the server too. Commands likes deploy
and cancel
are combinations of remote operations so they will make many useless server connection initiation.
Database is used both for cumulus' logs and for experiments.
There is 2 types of backends
Could support different backends for different bridges. Ex: cumulus' logs are store in MongoDB but experiments are with SQL.
NOTE: Only support MongoDB for now
(Copy of bouthilx/research#16)
(Related to #19)
File system is horribly slow in the lab. Better to keep stuff on local (H|S)DD. Should write a script to keep synched stuff on /data/lisatmp4 and /Tmp. Could apply to clusters too.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.