Comments (3)
@hjelmn not sure if this is important, but I've noticed that I get the deadlock less often when I don't call xpmem_remove
explicitly in my code. This makes me wonder: is there a possible cleanup problem when the publisher calls xpmem_remove
on a region, which is still attached to by the peers? In other words, publisher calls xpmem_remove
and only then the peer calls xpmem_detach
and xpmem_release
.
from xpmem.
@angainor have you found a solution or root cause for this? We are experiencing very similar crashes resulting in zombie/defunct processes in our AMD cluster running RHEL 8.6 and MLNX_OFED_LINUX-5.8-1.0.1.1:
[Wed Jun 14 17:28:43 2023] Call Trace:
[Wed Jun 14 17:28:43 2023] __schedule+0x2d1/0x840
[Wed Jun 14 17:28:43 2023] schedule+0x35/0xa0
[Wed Jun 14 17:28:43 2023] schedule_timeout+0x278/0x300
[Wed Jun 14 17:28:43 2023] ? number+0x324/0x360
[Wed Jun 14 17:28:43 2023] ? get_futex_key+0x98/0x3e0
[Wed Jun 14 17:28:43 2023] wait_for_completion+0x96/0x100
[Wed Jun 14 17:28:43 2023] __synchronize_srcu.part.17+0x83/0xb0
[Wed Jun 14 17:28:43 2023] ? __bpf_trace_rcu_utilization+0x10/0x10
[Wed Jun 14 17:28:43 2023] ? synchronize_srcu+0xad/0xf0
[Wed Jun 14 17:28:43 2023] mmu_notifier_unregister+0xa6/0xe0
[Wed Jun 14 17:28:43 2023] xpmem_flush+0x14a/0x170 [xpmem]
[Wed Jun 14 17:28:43 2023] filp_close+0x31/0x70
[Wed Jun 14 17:28:43 2023] put_files_struct+0x70/0xc0
[Wed Jun 14 17:28:43 2023] do_exit+0x32f/0xb10
[Wed Jun 14 17:28:43 2023] do_group_exit+0x3a/0xa0
[Wed Jun 14 17:28:43 2023] get_signal+0x158/0x870
[Wed Jun 14 17:28:43 2023] do_signal+0x36/0x690
[Wed Jun 14 17:28:43 2023] ? do_send_sig_info+0x63/0x90
[Wed Jun 14 17:28:43 2023] ? recalc_sigpending+0x17/0x60
[Wed Jun 14 17:28:43 2023] exit_to_usermode_loop+0x89/0x100
[Wed Jun 14 17:28:43 2023] do_syscall_64+0x19c/0x1b0
[Wed Jun 14 17:28:43 2023] entry_SYSCALL_64_after_hwframe+0x61/0xc6
from xpmem.
@cvmeq Unfortunately no, I still see those issues sometimes, mostly when you kill / interrupt a large job, or at job cleanup. Then only solution for me was not to use xpmem transport in OpenMPI / UCX
from xpmem.
Related Issues (20)
- PBS_MOM killed on job exit (xpmem_close_handler) HOT 1
- build failure on fedora29
- separate prefix for module? HOT 2
- Runtime failures in XPMEM when running MVAPICH2X or OpenMPI+UCX on POWER9 system HOT 2
- Build error on cpus_allowed/cpus_mask with kernel 4.18.0-240.1.1 HOT 1
- Missing tag/release for 2.6.4, 2.6.5
- Fail to build on 5.8/x86_64: implicit declaration of function 'flush_tlb_mm_range' HOT 2
- Cannot build xpmem vs. a kernel that built modules_prepare
- XPMEM runtime warning/error HOT 10
- cray-xpmem.pc: prefix, exec_prefix not set
- Performance issue HOT 2
- HPE Adapt (1 of at least 4) - Non-native huge page support
- HPE Adapt (2 of at least 4) - Handle pages being unmapped.
- HPE Adapt (3 of at least 4) - Avoid page fault deadlock, stemming from copy-on-write race.
- HPE Adapt (4 of at least 4) - Reliance on kernel functions not normally exposed.
- Fix github "About" description
- openeuler build xpmem error
- Build errors on kernel 6.1 and higher HOT 1
- `PDE_DATA` problems on AARCH64
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xpmem.