GithubHelp home page GithubHelp logo

Comments (3)

angainor avatar angainor commented on July 30, 2024

@hjelmn not sure if this is important, but I've noticed that I get the deadlock less often when I don't call xpmem_remove explicitly in my code. This makes me wonder: is there a possible cleanup problem when the publisher calls xpmem_remove on a region, which is still attached to by the peers? In other words, publisher calls xpmem_remove and only then the peer calls xpmem_detach and xpmem_release.

from xpmem.

cvmeq avatar cvmeq commented on July 30, 2024

@angainor have you found a solution or root cause for this? We are experiencing very similar crashes resulting in zombie/defunct processes in our AMD cluster running RHEL 8.6 and MLNX_OFED_LINUX-5.8-1.0.1.1:

[Wed Jun 14 17:28:43 2023] Call Trace:
[Wed Jun 14 17:28:43 2023]  __schedule+0x2d1/0x840
[Wed Jun 14 17:28:43 2023]  schedule+0x35/0xa0
[Wed Jun 14 17:28:43 2023]  schedule_timeout+0x278/0x300
[Wed Jun 14 17:28:43 2023]  ? number+0x324/0x360
[Wed Jun 14 17:28:43 2023]  ? get_futex_key+0x98/0x3e0
[Wed Jun 14 17:28:43 2023]  wait_for_completion+0x96/0x100
[Wed Jun 14 17:28:43 2023]  __synchronize_srcu.part.17+0x83/0xb0
[Wed Jun 14 17:28:43 2023]  ? __bpf_trace_rcu_utilization+0x10/0x10
[Wed Jun 14 17:28:43 2023]  ? synchronize_srcu+0xad/0xf0
[Wed Jun 14 17:28:43 2023]  mmu_notifier_unregister+0xa6/0xe0
[Wed Jun 14 17:28:43 2023]  xpmem_flush+0x14a/0x170 [xpmem]
[Wed Jun 14 17:28:43 2023]  filp_close+0x31/0x70
[Wed Jun 14 17:28:43 2023]  put_files_struct+0x70/0xc0
[Wed Jun 14 17:28:43 2023]  do_exit+0x32f/0xb10
[Wed Jun 14 17:28:43 2023]  do_group_exit+0x3a/0xa0
[Wed Jun 14 17:28:43 2023]  get_signal+0x158/0x870
[Wed Jun 14 17:28:43 2023]  do_signal+0x36/0x690
[Wed Jun 14 17:28:43 2023]  ? do_send_sig_info+0x63/0x90
[Wed Jun 14 17:28:43 2023]  ? recalc_sigpending+0x17/0x60
[Wed Jun 14 17:28:43 2023]  exit_to_usermode_loop+0x89/0x100
[Wed Jun 14 17:28:43 2023]  do_syscall_64+0x19c/0x1b0
[Wed Jun 14 17:28:43 2023]  entry_SYSCALL_64_after_hwframe+0x61/0xc6

from xpmem.

angainor avatar angainor commented on July 30, 2024

@cvmeq Unfortunately no, I still see those issues sometimes, mostly when you kill / interrupt a large job, or at job cleanup. Then only solution for me was not to use xpmem transport in OpenMPI / UCX

from xpmem.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.