Comments (19)
Hi @PedrooHR, can you please give us more details about your issue? Which heatdis test are you running, what did you change and what command lines are you using?
from veloc.
Hi @bnicolae, thanks for the reply.
This is the heatdis test. I changed a few lines from one of the FTI tests. And this is the config file.
It is not a problem that occurs only with this example, it occurs with any application I do the same steps. Command lines are the following:
mpicc heatdis.c -o heatdis -lveloc-client -lm
mpirun -np 3 ./heatdis 4
Any time I run this without an active veloc-backend
(letting the VeloC lib launch it, as available in VeloC 1.4 release), the application stuck at the end.
Logs:
- Application log (starting from beginning): heatdis.log
- Backend from beginning: heatdis-backend.log
- Application log (starting from the latest cp): heatdis-load.log
- Backend from latest cp: heatdis-load-backend.log
In log lines, after Execution finished in ... seconds.
the program won't finish.
If I launch the backend before, or one is active, or the sync
mode is used in the config file, everything goes fine.
I'm a researcher and in our project, we are leveraging VeloC as a checkpoint library, part of our MPI Fault Tolerance lib, this FT lib will be transparent to the final user, so it would be nice to use this feature of VeloC 1.4.
from veloc.
@PedrooHR, your test program has multiple problems: (1) you are specifying a relative path in the config file for persistent and scratch (which means the client and the backend may use different directories, depending on where they are launched from; (2) you do not check the result of any VELOC operation (which means the initialization or checkpointing may not be successful but you don't care and simply continue); (3) you are using a hardcoded config file name, again with a relative path (is it in the same directory where you run your program from?)
from veloc.
Hi @bnicolae, I've changed all paths in config (using /tmp/scratch
and /tmp/persistent
) file and in the cpp file (in the VELOC_init_single
function) to absolute paths. And, I'm now checking every Veloc function following this example pattern:
if (VELOC_Checkpoint("heatdis", ++v) != VELOC_SUCCESS) {
printf("CP Failed\n");
return 1;
}
I'm also not sure I understood what you mean with "hardcoded config file name" in (3).
The backend and application are running on the same machine.
The problem still the same as before.
from veloc.
Did you also check VELOC_Init_single to make sure it returns VELOC_SUCCESS? By (3) I mean you specify "heatdis.cfg" in VELOC_Init_single, which is relative to the current directory. Can you please attach the log of the active backend?
from veloc.
I've checked VELOC_Initi_single too, all functions are working.
I've already changed the cfg file path to be an absolute path in VELOC_Init_single in the previous comment.
Here is the log of the active backend with the last modifications. (As you can see, it's similar as reported in this comment)
Thanks in advance.
from veloc.
Ok, in that case you may want to check where you installed VELOC. Did you install a previous version too? Maybe the client is running an old veloc-backend. Make sure you run "export VELOC_BIN=<path_to_veloc_backend>". If that does not help either, try running "ctest --verbose" in your "build" folder. If this fails, please include a log of the auto-install.py script (used to compile and install VELOC).
from veloc.
Yes, I've installed previous versions of Veloc. But I'm sure that the backend is from this latest version. I've not set VELOC_BIN
env, but I have <veloc_install_dir>/lib
in my LD_LIBRARY_PATH
and LIBRARY_PATH
, and <veloc_install_dir>/bin
in my PATH
, so the lib can find veloc-backend
from PATH. I've double-checked and only one bin and lib path can be reached (of a new cleaner installation I made), the previous installations of Veloc are not reacheable.
As I said, I've made a new installation of Veloc (which veloc-backend
returns the bin path of the new installation).
The ctest --verbose
is fine, as you can check here
This is the new installation log ($ ./auto-install.py install
).
I'm also having the same problem running inside the docker container we use in our CI (the container has the latest version of Veloc and MPICH). I understand those problems could be a thing on my personal computer, but they should not happen inside the container with only one version.
from veloc.
Ok, can you please share your full code and build/test script as a zip? I can try to see if I can reproduce this problem
from veloc.
Hi @bnicolae
Here is the zip with the test case, just make
and make test
(with no veloc-backend active) to test the program.
Thanks.
from veloc.
@PedrooHR I cannot reproduce this error. For me, VeloC is working just fine. Can you tell me more about your setup? What Linux distribution are you using? Maybe you are using a customized older version (1.65.1) of Boost? Normally auto-install should download and use the latest version automatically (which as of now is 1.74.0).
Alternatively you can try to build VELOC without Boost, like so:
./auto-install.py --protocol socket_queue <install_dir>
from veloc.
Hi @bnicolae, I've tested
./auto-install.py --protocol socket_queue <install_dir>
and having same problem.
I've checked the installation log, in my PC veloc was using version 1.65.1 of Boost, but in the container (where I experienced the same problem) the version of boost was the latest one, 1.74.0.
My notebook run Ubuntu 18.04 and the container is also based in Ubuntu 18.04.
from veloc.
@PedrooHR, can you try to run a different distribution in your container? If this doesn't work can you provide a Docker file so I can recreate your container?
from veloc.
Hi @bnicolae
Here is the container we are using in our research. Veloc is installed under /opt/veloc/. You can check line 46 in the Image layers to see how veloc was installed.
This image doesn't container the test I sent you a few messages ago, sรณ you will probably need to copy the test into the container.
Thanks.
from veloc.
Just updating, if you could not get the container from the link above in time, you can look here, search for ubuntu18.04-cuda10.2-mpich
for the right container.
Thanks.
from veloc.
Thanks @PedrooHR, I'll take a look next week after SC20 is over, lots of stuff going on right now
from veloc.
@PedrooHR, I can confirm your issue. However, this is not due to VELOC, it's because of Hydra, the process launcher of mpich. Apparently they keep track of all processes launched through MPI, including the processes launched by the MPI ranks, and wait for them to finish. Normally, they should not do this but rather keep track of process sessions (so that you can "detach" processes just like deamons do). In any case, we will discuss this with the mpich team. Until then, you can try newer versions of mpich (maybe this is fixed), switch to OpenMPI or simply launch the backend in a script you supply to mpirun like this:
$ mpirun -np N <script.sh> <app> <parameters>
script.sh:
#!/bin/bash
veloc-backend &
$*
from veloc.
Thanks @bnicolae
I'll try that.
from veloc.
@PedrooHR: we have a new mode for VELOC where the active backends run as threads in existing MPI ranks of the application (one rank per node is elected as leader to run the active backend). You can try that out by setting the "threded = true" configuration option in the VELOC config file. Let me know if this works.
from veloc.
Related Issues (20)
- restart-in-place: copy cray aprun variant from scr HOT 2
- Down node detection on LSF has wrong node count HOT 1
- SLURM restart-in-place script hangs when forcing prolog on down node HOT 1
- SLURM restart-in-place script double counts down node HOT 1
- Build VELOC as a static library HOT 1
- VELOC install 64 bit libraries in /usr/lib instead of /usr/lib64 HOT 1
- Node down, VeloC XOR restart on the new allocated node HOT 5
- VeloC and MPI IO HOT 9
- error using test/heatdis example HOT 7
- Fortran 90 bindingd to VeloC? HOT 2
- Alternative to OpenSSL for md5 HOT 2
- Build fails at linking with undefined reference to `kvtree_xxx` on Cori (NERSC) HOT 1
- can't build with AXL 4.0.0 HOT 4
- Unable to run the example program HOT 2
- example: function call within assert HOT 2
- Interop with GPU compute kernels HOT 5
- Use MPI_Exscan to compute offsets?
- Component releases for Veloc v1.7 HOT 1
- MPI_Comm_split with uninitialized key value? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from veloc.