❯ python main-multigpu-sharedmem.py
Serializing 860001 elements to byte tensors and concatenating them all ...
Serialized dataset takes 505.05 MiB
Traceback (most recent call last):
File "/home/xiaot/miniconda3/envs/ram/lib/python3.8/multiprocessing/resource_sharer.py", line 147, in _serve
send, close = self._cache.pop(key)
KeyError: 1
Worker 2 obtains a dataset of length=860001 from its local leader.
time PID rss pss uss shared shared_file
------ ------ ----- ----- ----- -------- -------------
77856 616200 1.5G 1.5G 1.4G 105.0M 104.9M
time PID rss pss uss shared shared_file
------ ------ ------ ------ ------ -------- -------------
77856 616202 229.4M 150.6M 124.7M 104.7M 104.7M
Traceback (most recent call last):
File "main-multigpu-sharedmem.py", line 56, in <module>
launch(main, num_gpus, dist_url="auto")
File "/home/xiaot/miniconda3/envs/ram/lib/python3.8/site-packages/detectron2/engine/launch.py", line 69, in launch
mp.start_processes(
File "/home/xiaot/miniconda3/envs/ram/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/xiaot/miniconda3/envs/ram/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/xiaot/miniconda3/envs/ram/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/xiaot/miniconda3/envs/ram/lib/python3.8/site-packages/detectron2/engine/launch.py", line 123, in _distributed_worker
main_func(*args)
File "/home/xiaot/Downloads/RAM-multiprocess-dataloader/main-multigpu-sharedmem.py", line 24, in main
ds = DatasetFromList(TorchShmSerializedList(
File "/home/xiaot/Downloads/RAM-multiprocess-dataloader/serialize.py", line 73, in __init__
self._addr, self._lst = mp.reduction.ForkingPickler.loads(serialized)
File "/home/xiaot/miniconda3/envs/ram/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 305, in rebuild_storage_fd
fd = df.detach()
File "/home/xiaot/miniconda3/envs/ram/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/xiaot/miniconda3/envs/ram/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/home/xiaot/miniconda3/envs/ram/lib/python3.8/multiprocessing/reduction.py", line 159, in recvfds
raise EOFError
EOFError
I wonder if this can be reproduced by others and if there are any potential solutions to it? Many thanks.