Comments (7)
Thanks for reporting this issue, @mstormo!
Looks like you are on Windows operating system.
This SO issue recommends to run the snippet of the code within if __name__ == "__main__"
guard. Can you please run within the if __name__ == "__main__"
guard to see if it works?
from mdio-python.
This does work around the issue.
However, it's a highly unexpected requirement, given that my own code does not imply any multiprocessing, and the fact that an imported module will force this requirement may have significant implications in non-trivial systems, such as another module where you cannot enforce that the method of execution is isolated like this.
The Python process was forked over 30 times for this simple example.
Please note that Ctrl+C will not abort the process, since the other forked processes indeed continues in the background. So, the above suggested work-around only hides a larger issue.
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\segyio\trace.py", line 50, in wrapindex
Traceback (most recent call last):
if not 0 <= i < len(self):
self.run()rs\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\process.py", line 315, in _bootstrap
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\multiprocessing\pool.py", line 48, in mapstar
return list(map(*args))
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\mdio\segy\_workers.py", line 271, in trace_worker_map
return trace_worker(*args)
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\mdio\segy\_workers.py", line 182, in trace_worker
data_array.set_basic_selection(
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\zarr\core.py", line 1448, in set_basic_selection
return self._set_basic_selection_nd(selection, value, fields=fields)
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\zarr\core.py", line 1748, in _set_basic_selection_nd
self._set_selection(indexer, value, fields=fields)
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\zarr\core.py", line 1820, in _set_selection
self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\zarr\core.py", line 2018, in _chunk_setitems
to_store = {k: self._encode_chunk(v) for k, v in cdatas.items()}
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\zarr\core.py", line 2018, in <dictcomp>
to_store = {k: self._encode_chunk(v) for k, v in cdatas.items()}
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\zarr\core.py", line 2194, in _encode_chunk
cdata = self._compressor.encode(chunk)
File "C:\Users\mstormo\.pyenv\pyenv-win\versions\3.8.9\lib\site-packages\numcodecs\zfpy.py", line 70, in encode
return _zfpy.compress_numpy(
KeyboardInterrupt
Ingesting SEG-Y in 24 chunks: 38%|█████████████ | 9/24 [00:23<00:11, 1.28block/s]
from mdio-python.
Correction, Ctrl+C kills the current batch of Python processes, forks a new set, which just hangs on 0% CPU usage.
from mdio-python.
Thank you for your comments, @mstormo.
I don't think the if __name__ == "__main__"
guard is unexpected. Documentation of multiprocessing
recommends it as part of their programming guidelines.
Importing segy_to_mdio
from mdio
by default uses workers from https://github.com/TGSAI/mdio-python/blob/main/src/mdio/converters/segy.py#L19 which in turn imports multiprocessing
.
The alternatives are as follows:
- We can run as single process on Windows which is less desirable.
- Allow
multiprocessing
as an option on Windows - Specify the usage of guard in the documentation.
What do you suggest as a fix? Happy to accept a PR.
from mdio-python.
The unexpected statement was for MDIO as a "back box" for the end-user, with no mention of the required guard statement in the documentation or examples, while explicitly mentioning Dask for the purpose of parallel distributed processing.
Note that my example here was a trivial reproducible example just to illustrate the problem. In my own case, the implementation was in a plugin in a larger system, where the main system started several executions as a result. Rather convoluted to understand what was going on, given that I was not using subprocessees or the multiprocessing library anywhere in my own code.
As such, it's unexpected for the end-user, while not a surprise to you as the implementor, of course.
Given that the process forking makes the end-user lose control of the terminal (hangs on six idle sub-processes of Python), I think the only option is to run it as a single process by default on Windows, with an option to enable it, if desired. (Option 1 & 2, combined)
I expect the main-guard to be required on Linux too, so examples would need to be updated to indicate such if you keep it enabled by default.
from mdio-python.
Hi @mstormo, thanks for letting us know!
We will make updates to the documentation based on your feedback.
As @srib mentioned, when we use multiprocessing in Python, it almost always needs a main guard; by default, the ingestion uses multiprocessing (not Dask). The reading, writing, and export can use Dask if needed, but it's off by default. We have plans to Daskify the ingestion as well. We should clarify this for sure.
from mdio-python.
https://mdio-python.readthedocs.io/en/latest/reference.html#seismic-data
from mdio-python.
Related Issues (20)
- Add command line utility for copying mdio files
- Support for segy ingestion of non-regularized data
- parse_trace_headers bug HOT 3
- Add an ability to pass a dask client to `MDIOReader` HOT 2
- C++ bindings for MDIO HOT 1
- OOM when allocating live mask HOT 2
- header scan fails incorrect for little endian segy HOT 3
- Loading speed HOT 6
- Documentation build is failing in RtD
- Missing some tests for 4D data
- Add dev container
- Lossless issue file not found in version 0.2.9 or newer HOT 1
- Sparsity check override HOT 2
- Reduce memory requirements for segy cutting HOT 2
- MDIO copy is slow HOT 2
- Improve logging for autochannelwrap HOT 2
- Improved pre-stack indexing for 3D towed streamer data
- Mac M2 + segy->mdio conversion error HOT 2
- Configure GridSparsityCheck threshold HOT 1
- Write tests for rechunking
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mdio-python.