One issue: PBS Pro and Toque are in many ways different and may require separate modul

On Wed, Feb 24, 2016 at 1:37 PM, Gian Matharu gian@u

Thanks everyone for the useful comments. I've opened a new issue <a class="issue-link

problems related to PBS clusters about seisflows HOT 17 CLOSED

rmodrak commented on July 26, 2024

problems related to PBS clusters

from seisflows.

Comments (17)

rmodrak commented on July 26, 2024

On Wed, Feb 24, 2016 at 1:37 PM, Gian Matharu [email protected] wrote:
Hey Ryan,

 I was wondering if you'd noticed any lag between seisflows executing commands on clusters. I seem to have noticed that there are lags between ending one call and proceeding to another despite being seemingly complete with the prior task.

No, I don't believe we've had that problem. Perhaps some sort of 'epilogue'? http://docs.adaptivecomputing.com/torque/3-0-5/a.gprologueepilogue.php Or perhaps just unreliable behavior of PBS.

from seisflows.

rmodrak commented on July 26, 2024

Update: By replacing pbs environment variables with mpi directives, gian's system.mpi class removes the need to distinguish between pbs pro and torque in some cases; pbs_sm should work for both variants, I believe. To my knowledge, pbs_lg has been tested on pbs pro clusters but still needs testing on pbs torque clusters (while it may work on the latter with small modifications, it is not likely to work right out of the box).

from seisflows.

gianmatharu commented on July 26, 2024

I ran into another topic that may warrant some consideration. For the PBS cluster I'm currently using, certain software needs to be loaded prior to running codes e.g. "module load intel" will load intel compilers etc. Admittedly, I'm not sure if this is standard for PBS/clusters in general but would adding the option to specify such directives from within the seisflows system.submit be beneficial?

There are easy alternatives (using bash profiles or in some cases one could use a submission script to submit sfrun), so it's not a pressing issue and may not be necessary.

from seisflows.

rmodrak commented on July 26, 2024

For controlling user environment, I think the module utility is standard
for all types of clusters, not just PBS. I can't think of a cluster I've
used that hasn't had it. However, the list of available modules itself
(module avail) is highly system dependent, so trying to script module load ... may not be a good approach.

A better approach I think would be to include environment assertions in
check methods. For example, if mpi4py is required but not on the
PYTHONPATH, an environment exception should be returned.

On Thu, Mar 10, 2016 at 11:26 PM, gianmatharu [email protected]
wrote:

I ran into another topic that may warrant some consideration. For the PBS
cluster I'm currently using, certain software needs to be loaded prior to
running codes e.g. "module load intel" will load intel compilers etc.
Admittedly, I'm not sure if this is standard for PBS/clusters in general
but would adding the option to specify such directives from within the
seisflows system.submit be beneficial?

There are easy alternatives (using bash profiles or in some cases one
could use a submission script to submit sfrun), so it's not a pressing
issue and may not be necessary.

—
Reply to this email directly or view it on GitHub
#18 (comment)
.

from seisflows.

dkzhangchao commented on July 26, 2024

Hi Ryan,
Sorry for bothering you again, Actually I also meet with bugs when using system='pbs_sm', i am not sure if it's cause by the mpi4py.

when I get to 'bin/xspecfem2D', it meets bug
[gpc-f106n012-ib0:30348] [[7631,1],2] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30353] [[7631,1],7] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30349] [[7631,1],3] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30351] [[7631,1],4] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30350] [[7631,1],1] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30352] [[7631,1],6] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30355] [[7631,1],5] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30355] [[7631,1],0] routed:binomial: Connection to lifeline [[7631,0],0] lost
Do you know what this means?
Thanks

EDIT: revised for clarity [rmodrak]

from seisflows.

rmodrak commented on July 26, 2024

It seems like pbs_sm is not working on your cluster. Similar to the problem described here, mpi4py seems to fail when encountering Python subprocess calls. Strangely, on other PBS clusters, this is not always an issue.

As a workaround, try using pbs_torque_sm instead, which uses pbsdsh under the hood rather than mpi4py.

I'll go ahead edit your post for clarity if that's alright.

from seisflows.

gianmatharu commented on July 26, 2024

The MPI (mpi4py) system class may not be an ideal solution to task parallelism on clusters. It is an approach that aims to provide flexibility between cluster resource managers. While I haven't encountered the particular issue above, I have encountered issues with the class in other forms. As Ryan has suggested, it is probably best to try and utilize utilities provided by the resource manager (e.g. pbsdsh).

from seisflows.

rmodrak commented on July 26, 2024

Thanks everyone for the useful comments. I've opened a new issue #40 for discussing shot parallelism problems of the kind Chao is experiencing. In this new issue, I've posted some previous emails between Gian and myself that seem relevant to the problem at hand.

from seisflows.

dkzhangchao commented on July 26, 2024

Hi Ryan and Gian,

I have tried system='pbsdsh', i also meet the same problems when it execute the

    system.run('solver', 'setup', 
               hosts='all')

Actually, i generated the observe data after this, it means that the pbsdsh can be evoked here.
However, it seems that it's stuck in the system.run ( subprocess.call(pbsdsh) ). The bug is showed below:

Can you give me advice? BTW, i can get the forward result (obs data) after system.run

from seisflows.

gianmatharu commented on July 26, 2024

It does appear to be hanging, but it is not the issue I saw in issue #40
(where mpi4py was used). It seems to be due to pbsdsh hanging, which I have
encountered. The issue is inconsistent for me, It can occur on one attempt
and not the subsequent attempt. I don't observe it that frequently any
more.

What workflow are you attempting to run? I'd suggest checking any instances
of system.run (within the worfklow) to check that all the calls to pbsdsh
aren't problematic.

On Mon, Sep 26, 2016 at 1:07 AM, CHAO ZHANG [email protected]
wrote:

Hi Ryan and Gian,

I have tried system='pbsdsh', i also meet the same problems when it
execute the
system.run('solver', 'setup',
           hosts='all')
[image: 11]
https://cloud.githubusercontent.com/assets/8068058/18825181/0f90753a-8395-11e6-95c5-0ee7f81da1ab.png

Actually, i generated the observe data after this, it means that the
pbsdsh can be evoked here.
However, it seems that it's stuck in the system.run (
subprocess.call(pbsdsh) ). The bug is showed below:
[image: 22]
https://cloud.githubusercontent.com/assets/8068058/18825372/0fc398c4-8396-11e6-8328-36b0511bf6a0.png

Can you give me advice? BTW, i can get the forward result (obs data) after
system.run

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#18 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AMC0UFdtgvYkx5Egvs31vCq5-OY9RrcPks5qt28rgaJpZM4HduVA
.

from seisflows.

dkzhangchao commented on July 26, 2024

I do the inversion in workflow. This is the first time to invoke system.run ( that is, forwarding to get the observe data) , i am confused that if the system.run has some problem, why it also generate the data after execute the code

system.run('solver', 'setup',
hosts='all')
BTW, do you think my command of system.run is correct?

from seisflows.

rmodrak commented on July 26, 2024

Perhaps using pbsdsh to invoke a simple "hello world" script might be useful as a debugging step. The integration test located in seisflows/tests/test_system could be employed for this purpose.

That said, my understanding from speaking to our local cluster administrator is that the pbsdsh utility itself can be unreliable. This may well be the explanation for what you are seeing, but it may be worth troubleshooting a bit to make sure there's not some alternate explanation.

from seisflows.

rmodrak commented on July 26, 2024

If pbsdsh is in fact the problem, a workaround might be to create a your own dsh script by calling ssh within a for loop. You would need to (1) manually specify the compute nodes to run on using the list of allocated nodes made available by PBS (2) use the SendEnv option or something similar to assign each ssh process a unique indentifier and (3) wait until all the ssh processes complete before moving on. A disadvantage of this approach might be that the child ssh processes might continue running in the event the parent process fails.

from seisflows.

rmodrak commented on July 26, 2024

Yet another workaround would be to write an entirely new PBS system class based the following package:

https://radicalpilot.readthedocs.io/en/latest/

Through an NSF grant, we are currently collaborating with the developers of the radical pilot package, so such an interface might eventually be added to seisflows but not for a while yet. If you get something to work in the meantime please feel free to submit a pull request.

from seisflows.

rmodrak commented on July 26, 2024

To reflect the expanded scope of the discussion, I'll go ahead and change the issue title to something more general.

Also, it's worth noting that none of the problems mentioned above have been encountered on SLURM clusters. So if the possibility ever arises to switch to SLURM I'd highly recommend it...

from seisflows.

dkzhangchao commented on July 26, 2024

Hi Rran,
Yep, i agree with you. To be honest, there are more issues for pbs system than slurm. The staff of my cluster tells me that it seems there is a problem for pbsdsh, i write a scrip which emulates the behavior of pbsdsh. Does it accord with your suggestion?

#!/bin/bash
#temporary work around for pbsdsh
#cat ${PBS_NODEFILE}
k=0;
for i in cat $PBS_NODEFILE;
do
ssh $i "export PBS_VNODENUM=$k; $@" ;
k=$(($k + 1));
done

Then it can run successfully, but compared with pbsdsh, it's running much slower.

from seisflows.

rmodrak commented on July 26, 2024

This is definitely in the right direction. If you were implementing it in bash, it would be something like you have written. However, since you would be overloading the 'run' method from the pbs system class, actually it needs to be implemented in python. Come to think of it, rather than a subprocess call to ssh, in python it would be probably be better to use parminko (http://stackoverflow.com/questions/3586106/perform-commands-over-ssh-with-python)

from seisflows.

problems related to PBS clusters about seisflows HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs