GithubHelp home page GithubHelp logo

Comments (5)

Qcellaris avatar Qcellaris commented on September 1, 2024

I have also tested the code on the HPC cluster where we are using TESLA V100s and I run into the same errors. I have attached the corresponding error log here as well but I don't think it really gives us any new information.

slurm-76649.txt

from pysph.

inducer avatar inducer commented on September 1, 2024

See also inducer/pyopencl#562 (comment).

from pysph.

prabhuramachandran avatar prabhuramachandran commented on September 1, 2024

Hi, sorry about the slow response. Is it possible that there is a blow up of the particles and a large increase of the domain size? As regards the restart it is possible that there is an issue with restarting on the GPU because some necessary state has not been saved. Do you have a small reproducible example? That would help debug this issue better.

from pysph.

Qcellaris avatar Qcellaris commented on September 1, 2024

I don't see any blow up or large increase in the domain size when I look at the results of the last data frame. The crash happens after hours of simulation on the GPU (approximately 70k iterations). If I restart from the last output file on a CPU it just continues fine without blow up or large increase in domain size. I have attached the files of the simulations that gave us these errors (I changed the .py extension to .txt, otherwise I couldn't include them in this message). I will check if I also encounter this issue with a smaller example.

We have been looking into the issue ourselves for a while as well. Inducer mentioned in the comment above: "An unsigned integer underflow comes to mind as a possible reason." The only place where I found unsigned integers had to do with particle indexes and are used in e.g. neighbor lists. Since the code runs fine on a CPU and this error is only occurring on a GPU we thought it possibly had to do with the specific implementation of neighbor lists on the GPU. Might it be that neighbor list memory on GPU is not dynamic and that the length of the neighbor lists has a fixed maximum? Say the maximum amount of neighbors in the neighbor list is 30 and at some moment during the simulations the amount of neighbors exceeds this number we might end up in these kind of unsigned integer underflow problems. If there is such a hard cap on the amount of neighbors I could try to change this to a larger number and see if that solves the issue, but I couldn't find anything on that matter.

collision.txt
surface_tension_adami.txt

from pysph.

Qcellaris avatar Qcellaris commented on September 1, 2024

Hi, sorry about the slow response. Is it possible that there is a blow up of the particles and a large increase of the domain size? As regards the restart it is possible that there is an issue with restarting on the GPU because some necessary state has not been saved. Do you have a small reproducible example? That would help debug this issue better.

Dear Prabhu,

The simulations just continue perfectly fine on a CPU, but when I continue them on a GPU they crash. If it would be a blow up of particles it would crash on CPU as well as on GPU right?

I also don't think it is related to some necessary state that is not saved for the restart, because when we run the simulation from start it also crashes every time at the same point and from there on restarting gives the same error log as is returned during a run straight from the start.

Maybe we can look into this issue in more detail together?

Best,

Stephan

from pysph.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.