Comments (10)
Any news here in the last two years?
from ipyparallel.
The plan is to put a nanny process next to each Engine, which would enable remote signalling, restarting, etc. This is a general plan for Jupyter kernels that will be extended to IPython parallel.
from ipyparallel.
The tricky bit for IPython parallel is to not ruin cases like MPI, where the engine itself must be the main process, it cannot be started by the nanny. This means that either the engine starts the nanny the first time, or we special case MPI somehow.
from ipyparallel.
Right. MPI.
Well, at the moment, I can't see reliably handling restarts for MPI with any fewer than 3 processes. Kinda dumb, but here's the picture I have in mind...
We have to allow that there may be at least three computers involved in an mpi situation.
- LauncherComputer -- may shut down at any time; once command is run, this should be irrelevant
- ClusterNodeA -- a compute node, ipengine could live here
- ClusterNodeB -- a compute node, ipengine could live here
We want to be able to send keyboard interrupt signals to the engine, ergo, the nanny needs to be on the same node as the engine (correct me if I'm wrong). So at the very least, we would need a setup like this.
- engine (on ClusterNodeA, launched by mpiexec run on LauncherComputer)
- nanny (on ClusterNodeA, launched by engine)
Now let's say we get a restart signal. We need to kill engine and launch a new engine that is also part of the same MPI universe. We can do this, with, e.g. MPI_Comm_spawn
. Trouble is this: MPI may be subject to an arbitrary and cruel resource manager. It may decide to put the new engine on ClusterNodeB. In which case the nanny needs to live on ClusterNodeB. But it doesn't. Failbot.
To deal with this situation, we actually need a third process. The setup now looks like this:
- megananny (on ClusterNodeA, launched by mpiexec)
- engine (on ClusterNodeA, launched by MPI_Comm_spawn)
- micronanny (on ClusterNodeA, launched by engine)
Now if nanny is told to keyboard-interrupt, it talks to the micronanny, that actually sends the SIGINT. If nanny is told to restart, it creates an entirely new (micronanny,engine) pair, which might be on ClusterNodeA and it might be on ClusterNodeB.
Remarks
- One downside of this approach is that the engines will have to make a new intracommunicator (i.e. users can't depend on
COMM_WORLD
). However, I cannot see any way of avoiding this; if you want to be able to start new processes, you need to use some kind of spawn or join. That will create intercommunicators, which need to get merged into intracommunicators. So we'll want to insert some variableCOMM_IPWORLD
into the namespace, so you can replaceMPI.COMM_WORLD.Allreduce
withCOMM_IPWORLD.Allreduce
. - There are certainly situations in which one can guarantee an MPI process will start on the same host. In this case you wouldn't need the megananny. This may even be the common case; I'm not terribly well acquainted with "standard practice." I could do a little survey of the supercomputers I have access to and check. But there are definitely situations in which I don't know how you could make such a guarantee...
- I've never actually tried to kill a single node of an intracommunicator forged by repeatedly spawning and merging. It's possible something will explode.
from ipyparallel.
...in conclusion, I hope MPI doesn't block progress on this. When they introduced "dynamic processes management" in MPI2.0, I don't believe they were thinking of a scenario where a single worker could restart.
At the base minimum, every time an mpi process restarts we will need to destroy an old intracommunicator and inject a new one into the ipengine namespace. If users have data structures referencing the old intracommunicator, they will become invalid. Which could be a bit confusing for users :). But perhaps somebody with more MPI-foo can come along and prove me wrong!
In other news, let me know if there's a useful way I could contribute to the kernel-nanny architecture for Jupyter.
from ipyparallel.
I'm 100% okay with MPI engines not being allowed to be restarted, that's not a problem. It's just the parent/child relationship that's an issue, because the mpiexec'd process must be the actual kernel, not the nanny.
from ipyparallel.
Cool. That makes sense.
from ipyparallel.
The engine restart feature would be a really useful. E.g. I use theano on a cluster. Once I import theano a GPU is assigned to the importing process. The only way (I know of) to "free" the GPU again is to terminate/restart the process.
from ipyparallel.
Any news in the last 4 years?
import ipyparallel as ipp
client = ipp.Client()
client.shutdown(targets=range(10, 24), restart=True)
NotImplementedError: Engine restart is not yet implemented
from ipyparallel.
#463 lays the groundwork for this to be possible
from ipyparallel.
Related Issues (20)
- Leftover resource_tracker processes
- No module named 'jupyter_server' HOT 2
- How to make it work with torch DDP HOT 4
- Transition from `CompositeError` to builtin `ExceptionGroup` HOT 1
- ipcluster nbextension enable not working after notebook upgrade HOT 2
- Print in multiprocessing.Process crashing the engine HOT 7
- Windows ssh support by ipcluster HOT 33
- map_sync with pandas operation function does not finish. HOT 1
- Py3.10 code serialization does not work on PyPy3.10
- sync_imports not working as intended HOT 9
- ipyparallel and pymoo doesn't work HOT 2
- AsyncResult.join doesn't work
- AsyncResult.abort() call hangs if not all jobs can be stopped HOT 1
- Question: engines and databases HOT 1
- BroadcastView map Not Implemented HOT 3
- Cannot run ipythonparallel with openmpi HOT 7
- 60s timeout on get_connection_info() is not configurable HOT 1
- please release/tag/pypi the current version as it supports JupyterLab 4.x HOT 2
- SSHEngineLauncher does not work as expected HOT 2
- Outstanding task on client but hub says completed when using broadcast view
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ipyparallel.