GithubHelp home page GithubHelp logo

Comments (4)

scemama avatar scemama commented on July 25, 2024

Hello,

MPI is not required to run QP in multi-node, as the communications occur with the ZeroMQ library with TCP sockets.

You first need to run a standard single node calculation:

qp_run fci <EZFIO>

This will be the "master" run. It opens a ZeroMQ socket at an address and port number stored in the <EZFIO>/work/qp_run_address file. It should look like: tcp://192.1268.1.91:47279

On another machine, you can run a "slave" calculation that will connect to the master to accelerate it:

qp_run --slave fci <EZFIO>

If the file system is shared, the slave calculation will read the qp_run_address file to get the address and port number of qp_run to attach to it.

You can run as many slaves as you want, and you can start them at any time.

If you want to use multiple slaves, then it is worth using MPI for the slave process:

mpirun qp_run --slave fci <EZFIO>

In this mode, only rank zero of the slave will make the ZeroMQ connection to the master, and the common data will be propagated to all the other slaves using a MPI broadcast, which is much faster than doing multiple ZeroMQ communications.

If you look at the qp_srun script, you will see that it is exactly doing that:

srun -N 1 -n 1 qp_run $PROG $INPUT &    # Runs the master
srun -n $((${SLURM_NTASKS}-1))  \
    qp_run --slave $PROG $INPUT > $INPUT.slaves.out  # Runs N-1 slaves as a single MPI run

Warning: Only the Davidson diagonalization and PT2 selection/perturbation take advantage of multi-node parallelism.

So to answer your question, the only thing you need is to make it possible for the slaves to connect to the master. The simplest way is to put them on the same network. If you can't do it, you can run qp_tunnel instances on each machine on the path between the slave and the master, and the tcp packets will be forwarded from one network to the next.

You can have a look at this presentation to better understand how all this works:
https://zenodo.org/records/4321326/preview/JCAD2019AScemama.pdf

Important: There was something wrong on the dev branch. We have created the dev-stable branch, where we have taken all the good things of the dev branch and thrown away all the changes that broke backwards compatibility, so the dev branch has been discontinued and will never be merged into the master. I suggest that you use instead dev-stable.

from qp2.

bavramidis avatar bavramidis commented on July 25, 2024

Hello @scemama ,

Thank you for this information.

Efforts in first running 'qp_run fci ' as the master followed by running 'qp_run --slave fci ' as a separate job within the same directory seems to properly connect the slave to the master. However, shortly after the slave reads in the tcp address from the master, I run into a floating point error and the calculation exits with error code 136.

See output for both master and slave.
FCI_Master.txt
FCI_Slave.txt

Once the slave job runs into this floating point error, the calculation terminates. The master job continues to run but it does not carry the calculation any further as seen from it cancelling due to time limit.

It is worth noting that the slurm file used to submit both these calculation uses the QP2 singularity but also has singularity, openmpi, gcc and libzmq modules loaded. Without the libzmq module loaded within the slurm, the slave does not connect to the master as it does in the above FCI files.

See slurm file:
slurm.txt

My guess is either 1.) I have not properly created the singularity, though this works fine using one node. Or 2.) I do not have a necessary module loaded within the slurm file.

Thank you and I appreciate any additional feedback,

Ben

from qp2.

scemama avatar scemama commented on July 25, 2024

Hi @bavramidis,
from your outputs, it seems that you are very close! Your master and slave are communicating well: at line 454 of the output of the slave, it says "selection.00000007" which means that qp_run is doing its 7th parallel kernel, which is a selection. So the ZeroMQ part is OK.

  1. Can you post the files run1.sh, and /qp2/src/fci/IRPF90_temp/cipsi/selection.irp.F90 present in the container?
  2. Which configuration file have you used when you did ./configure -c before compiling QP?

from qp2.

bavramidis avatar bavramidis commented on July 25, 2024

Hi @scemama ,

The issue is resolved using the dev-stable branch, as you had mentioned prior.

Thank you for your help!

Ben

from qp2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.