GithubHelp home page GithubHelp logo

Comments (19)

bnicolae avatar bnicolae commented on July 20, 2024

Hi @PedrooHR, can you please give us more details about your issue? Which heatdis test are you running, what did you change and what command lines are you using?

from veloc.

PedrooHR avatar PedrooHR commented on July 20, 2024

Hi @bnicolae, thanks for the reply.

This is the heatdis test. I changed a few lines from one of the FTI tests. And this is the config file.

It is not a problem that occurs only with this example, it occurs with any application I do the same steps. Command lines are the following:

  1. mpicc heatdis.c -o heatdis -lveloc-client -lm
  2. mpirun -np 3 ./heatdis 4

Any time I run this without an active veloc-backend (letting the VeloC lib launch it, as available in VeloC 1.4 release), the application stuck at the end.

Logs:

In log lines, after Execution finished in ... seconds. the program won't finish.

If I launch the backend before, or one is active, or the sync mode is used in the config file, everything goes fine.

I'm a researcher and in our project, we are leveraging VeloC as a checkpoint library, part of our MPI Fault Tolerance lib, this FT lib will be transparent to the final user, so it would be nice to use this feature of VeloC 1.4.

from veloc.

bnicolae avatar bnicolae commented on July 20, 2024

@PedrooHR, your test program has multiple problems: (1) you are specifying a relative path in the config file for persistent and scratch (which means the client and the backend may use different directories, depending on where they are launched from; (2) you do not check the result of any VELOC operation (which means the initialization or checkpointing may not be successful but you don't care and simply continue); (3) you are using a hardcoded config file name, again with a relative path (is it in the same directory where you run your program from?)

from veloc.

PedrooHR avatar PedrooHR commented on July 20, 2024

Hi @bnicolae, I've changed all paths in config (using /tmp/scratch and /tmp/persistent) file and in the cpp file (in the VELOC_init_single function) to absolute paths. And, I'm now checking every Veloc function following this example pattern:

if (VELOC_Checkpoint("heatdis", ++v) != VELOC_SUCCESS) { 
   printf("CP Failed\n"); 
   return 1; 
} 

I'm also not sure I understood what you mean with "hardcoded config file name" in (3).

The backend and application are running on the same machine.

The problem still the same as before.

from veloc.

bnicolae avatar bnicolae commented on July 20, 2024

Did you also check VELOC_Init_single to make sure it returns VELOC_SUCCESS? By (3) I mean you specify "heatdis.cfg" in VELOC_Init_single, which is relative to the current directory. Can you please attach the log of the active backend?

from veloc.

PedrooHR avatar PedrooHR commented on July 20, 2024

I've checked VELOC_Initi_single too, all functions are working.

I've already changed the cfg file path to be an absolute path in VELOC_Init_single in the previous comment.

Here is the log of the active backend with the last modifications. (As you can see, it's similar as reported in this comment)

Thanks in advance.

from veloc.

bnicolae avatar bnicolae commented on July 20, 2024

Ok, in that case you may want to check where you installed VELOC. Did you install a previous version too? Maybe the client is running an old veloc-backend. Make sure you run "export VELOC_BIN=<path_to_veloc_backend>". If that does not help either, try running "ctest --verbose" in your "build" folder. If this fails, please include a log of the auto-install.py script (used to compile and install VELOC).

from veloc.

PedrooHR avatar PedrooHR commented on July 20, 2024

Yes, I've installed previous versions of Veloc. But I'm sure that the backend is from this latest version. I've not set VELOC_BIN env, but I have <veloc_install_dir>/lib in my LD_LIBRARY_PATH and LIBRARY_PATH, and <veloc_install_dir>/bin in my PATH, so the lib can find veloc-backend from PATH. I've double-checked and only one bin and lib path can be reached (of a new cleaner installation I made), the previous installations of Veloc are not reacheable.

As I said, I've made a new installation of Veloc (which veloc-backend returns the bin path of the new installation).

The ctest --verbose is fine, as you can check here

This is the new installation log ($ ./auto-install.py install).

I'm also having the same problem running inside the docker container we use in our CI (the container has the latest version of Veloc and MPICH). I understand those problems could be a thing on my personal computer, but they should not happen inside the container with only one version.

from veloc.

bnicolae avatar bnicolae commented on July 20, 2024

Ok, can you please share your full code and build/test script as a zip? I can try to see if I can reproduce this problem

from veloc.

PedrooHR avatar PedrooHR commented on July 20, 2024

Hi @bnicolae

Here is the zip with the test case, just make and make test (with no veloc-backend active) to test the program.

Thanks.

from veloc.

bnicolae avatar bnicolae commented on July 20, 2024

@PedrooHR I cannot reproduce this error. For me, VeloC is working just fine. Can you tell me more about your setup? What Linux distribution are you using? Maybe you are using a customized older version (1.65.1) of Boost? Normally auto-install should download and use the latest version automatically (which as of now is 1.74.0).

Alternatively you can try to build VELOC without Boost, like so:
./auto-install.py --protocol socket_queue <install_dir>

from veloc.

PedrooHR avatar PedrooHR commented on July 20, 2024

Hi @bnicolae, I've tested
./auto-install.py --protocol socket_queue <install_dir> and having same problem.

I've checked the installation log, in my PC veloc was using version 1.65.1 of Boost, but in the container (where I experienced the same problem) the version of boost was the latest one, 1.74.0.

My notebook run Ubuntu 18.04 and the container is also based in Ubuntu 18.04.

from veloc.

bnicolae avatar bnicolae commented on July 20, 2024

@PedrooHR, can you try to run a different distribution in your container? If this doesn't work can you provide a Docker file so I can recreate your container?

from veloc.

PedrooHR avatar PedrooHR commented on July 20, 2024

Hi @bnicolae

Here is the container we are using in our research. Veloc is installed under /opt/veloc/. You can check line 46 in the Image layers to see how veloc was installed.

This image doesn't container the test I sent you a few messages ago, sรณ you will probably need to copy the test into the container.

Thanks.

from veloc.

PedrooHR avatar PedrooHR commented on July 20, 2024

Just updating, if you could not get the container from the link above in time, you can look here, search for ubuntu18.04-cuda10.2-mpich for the right container.

Thanks.

from veloc.

bnicolae avatar bnicolae commented on July 20, 2024

Thanks @PedrooHR, I'll take a look next week after SC20 is over, lots of stuff going on right now

from veloc.

bnicolae avatar bnicolae commented on July 20, 2024

@PedrooHR, I can confirm your issue. However, this is not due to VELOC, it's because of Hydra, the process launcher of mpich. Apparently they keep track of all processes launched through MPI, including the processes launched by the MPI ranks, and wait for them to finish. Normally, they should not do this but rather keep track of process sessions (so that you can "detach" processes just like deamons do). In any case, we will discuss this with the mpich team. Until then, you can try newer versions of mpich (maybe this is fixed), switch to OpenMPI or simply launch the backend in a script you supply to mpirun like this:

$ mpirun -np N <script.sh> <app> <parameters>
script.sh:
#!/bin/bash
veloc-backend &
$* 

from veloc.

PedrooHR avatar PedrooHR commented on July 20, 2024

Thanks @bnicolae
I'll try that.

from veloc.

bnicolae avatar bnicolae commented on July 20, 2024

@PedrooHR: we have a new mode for VELOC where the active backends run as threads in existing MPI ranks of the application (one rank per node is elected as leader to run the active backend). You can try that out by setting the "threded = true" configuration option in the VELOC config file. Let me know if this works.

from veloc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.