Comments (4)
Hello,
MPI is not required to run QP in multi-node, as the communications occur with the ZeroMQ library with TCP sockets.
You first need to run a standard single node calculation:
qp_run fci <EZFIO>
This will be the "master" run. It opens a ZeroMQ socket at an address and port number stored in the <EZFIO>/work/qp_run_address
file. It should look like: tcp://192.1268.1.91:47279
On another machine, you can run a "slave" calculation that will connect to the master to accelerate it:
qp_run --slave fci <EZFIO>
If the file system is shared, the slave calculation will read the qp_run_address
file to get the address and port number of qp_run
to attach to it.
You can run as many slaves as you want, and you can start them at any time.
If you want to use multiple slaves, then it is worth using MPI for the slave process:
mpirun qp_run --slave fci <EZFIO>
In this mode, only rank zero of the slave will make the ZeroMQ connection to the master, and the common data will be propagated to all the other slaves using a MPI broadcast, which is much faster than doing multiple ZeroMQ communications.
If you look at the qp_srun
script, you will see that it is exactly doing that:
srun -N 1 -n 1 qp_run $PROG $INPUT & # Runs the master
srun -n $((${SLURM_NTASKS}-1)) \
qp_run --slave $PROG $INPUT > $INPUT.slaves.out # Runs N-1 slaves as a single MPI run
Warning: Only the Davidson diagonalization and PT2 selection/perturbation take advantage of multi-node parallelism.
So to answer your question, the only thing you need is to make it possible for the slaves to connect to the master. The simplest way is to put them on the same network. If you can't do it, you can run qp_tunnel
instances on each machine on the path between the slave and the master, and the tcp packets will be forwarded from one network to the next.
You can have a look at this presentation to better understand how all this works:
https://zenodo.org/records/4321326/preview/JCAD2019AScemama.pdf
Important: There was something wrong on the dev
branch. We have created the dev-stable
branch, where we have taken all the good things of the dev
branch and thrown away all the changes that broke backwards compatibility, so the dev
branch has been discontinued and will never be merged into the master. I suggest that you use instead dev-stable
.
from qp2.
Hello @scemama ,
Thank you for this information.
Efforts in first running 'qp_run fci ' as the master followed by running 'qp_run --slave fci ' as a separate job within the same directory seems to properly connect the slave to the master. However, shortly after the slave reads in the tcp address from the master, I run into a floating point error and the calculation exits with error code 136.
See output for both master and slave.
FCI_Master.txt
FCI_Slave.txt
Once the slave job runs into this floating point error, the calculation terminates. The master job continues to run but it does not carry the calculation any further as seen from it cancelling due to time limit.
It is worth noting that the slurm file used to submit both these calculation uses the QP2 singularity but also has singularity, openmpi, gcc and libzmq modules loaded. Without the libzmq module loaded within the slurm, the slave does not connect to the master as it does in the above FCI files.
See slurm file:
slurm.txt
My guess is either 1.) I have not properly created the singularity, though this works fine using one node. Or 2.) I do not have a necessary module loaded within the slurm file.
Thank you and I appreciate any additional feedback,
Ben
from qp2.
Hi @bavramidis,
from your outputs, it seems that you are very close! Your master and slave are communicating well: at line 454 of the output of the slave, it says "selection.00000007" which means that qp_run is doing its 7th parallel kernel, which is a selection. So the ZeroMQ part is OK.
- Can you post the files
run1.sh
, and/qp2/src/fci/IRPF90_temp/cipsi/selection.irp.F90
present in the container? - Which configuration file have you used when you did
./configure -c
before compiling QP?
from qp2.
Hi @scemama ,
The issue is resolved using the dev-stable branch, as you had mentioned prior.
Thank you for your help!
Ben
from qp2.
Related Issues (20)
- installation of ocaml: gmp-devel and zeromq-devel needed, even if present in libs and include HOT 4
- Installation failure with zmq/ocaml during make HOT 16
- qp2-dependencies not at latest commit HOT 1
- Makefile contents HOT 5
- How to execute multiple test cases at the same time HOT 5
- Test case script HOT 1
- /bin/qp_mpirun
- ./configure -c config/gfortran_mpi.cfg HOT 4
- Software packaging problems HOT 13
- ocaml 4.11 in qp2-dependencies incompatible with glibc>=2.34 HOT 1
- SCF energy HOT 1
- Issue with print_wf after running CIPSI calculation HOT 3
- OpenSUSE problem HOT 1
- Type mismatch HOT 1
- Ninja: build stopped: subcommand failed HOT 2
- Problems with qp run diagonalize_h HOT 3
- Issues Installing Quantum Package Using zsh Shell HOT 2
- orbital optimization crash with memory issue. HOT 2
- qp edit feature request HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qp2.