Comments (5)
Approach and Plan
For DataChunkIterators
the easiest place to start I believe is to look at how to parallelize the ZarrIODataChunkIteratorQueue.exhaust_queue
function.
The reason this is a good place to start is because: 1) it keeps things simple, because at that point all the setup for the file and datasets is already done and all we deal with is writing the datasets itself 2) it's a pretty simple function, and 3) it allows us to parallelize across both chunks within a dataset as well as across all DataChunkIterator
that are being used. I believe this should actually cover all cases that involve writing of DataChunkIterator
, because I believe the ZarrIO.write_dataset
method also just calls the exhaust_queue
function to write from DataChunkIterator
, here:
hdmf-zarr/src/hdmf_zarr/backend.py
Lines 839 to 840 in 531296c
only in this case the queue will always just contain a single DataChunkIterator
object. I.e., when calling ZarrIO.write
with exhaust_dci=False
then we would get all DataChunkIterator
objects at once as part of the queue and if exhaust_dci=True
then we would see them one-at-a-time. I.e. parallelizing exhaust_queue
I believe should take care of all cases related to parallel write of DataChunkIterator
objects.
If we also want to do parallel write for other datasets that are being specified via numpy arrays or lists, then that would require us to also look at ZarrIO.write
, but I think as a start focusing on DataChunkIterator
should be fine. Also, for the more general case we also likely need to worry more about how to specifiy how to actually determine intelligently how to write these in parallel, whereas with DataChunkIterator
and ZarrDataIO
there is a more explicit way for the user to control things.
Identify the best injection point for parallelization parameters in the io.write() stack of HDMF-Zarr
If it needs to be parametrized on a per-dataset level then the definition of parameters would probably go into ZarrDataIO. If this needs to parametrized across datasets (i.e., use the same settings for all) then this would probably either go as parameters of ZarrIO.write
or ZarrIO
itself. I think this all depends on how flexible we need it to be. ZarrIO
may end up having to pass these to ZarrDataChunkIteratorQueue
, but that is an internal detail that a user would normally not see, since the queue is an internal data structure that is not exposed to the user (i.e., it is only stored in a private member variable and should not be used directly by the user).
from hdmf-zarr.
Another part that we likely need to consider is to:
- ensure that we use the correct synchornizer for Zarr in
ZarrIO.__init__
- We may also need to ensure that
numcodecs.blosc.use_threads = False
is set according to https://zarr.readthedocs.io/en/stable/tutorial.html#parallel-computing-and-synchronization
from hdmf-zarr.
@CodyCBakerPhD what approach for parallel write were you thinking of, Dask
, MPI
, joblib
, multiprocessing
etc.?
from hdmf-zarr.
(1) ensure that we use the correct synchornizer for Zarr in ZarrIO.init
As I understand it, with multiprocessing we should not need a synchronizer if the DataChunkIterator can be trusted to properly partition the data evenly across chunks (which the GenericDataChunkIterator
will necessarily - so maybe easiest to only support that to begin with)
Unsure about the threading, haven't tried that yet
(2) @CodyCBakerPhD what approach for parallel write were you thinking of, Dask, MPI, joblib, multiprocessing etc.?
concurrents.futures
is our go-to choice as a nicer simplification of multiprocessing
and multithreading
without needing to add much management of queues and other overhead
Would probably start there but we could think of a design that is extendable to a user's choice of backend, especially given how nice Dask is regarded to work with Zarr
I do not recommend MPI; the method of encoding the parallelization has always felt rather awkward and I don't even know how we would use it at these deeper levels of abstraction (demos of MPI I've seen have consisted of simple scripts that get 'deployed' via CLI - not exactly our use case either)
from hdmf-zarr.
concurrents.futures
is our go-to choice
Sounds reasonable to me as a start.
we could think of a design that is extendable to a user's choice of backend
Let's see how this goes. I think if ZarrIODataChunkIteratorQueue.exhaust_queue is the main place we need to modify, then to make it configurable we could either: 1) implement different derived versions of ZarrIODataChunkIteratorQueue
(e.g., a DaskZarrIODataChunkIteratorQueue
) and the have a parameter on ZarrIO.__init__
to set the queue to use or 2) we could make the behavior configurable in a single ZarrIODataChunkIteratorQueue
class. But at first glance, I think I'd prefer option 1.
from hdmf-zarr.
Related Issues (20)
- [Bug]: Test are failing with latest HDMF HOT 1
- [Feature]: Remove support for python 3.7
- [Bug]: DeployRelease
- [Feature]: Support parallelization for export
- [Bug]: ZarrIO cannot resolve Builders HOT 6
- [Bug]: Build on ReadTheDocs failing
- Pipelines failing due to codecov HOT 2
- Gallery tests failing
- conda-linux-python3.7-minimum test failing HOT 3
- HDMF Zarr needs to be compatible with HDMF 3.5.5 and up HOT 4
- Add roundtrip test for Zarr with DataChunkIterator
- Add tests for Python 3.11 HOT 1
- Remove test.py
- Sphinx TypeError
- Add support for ruff and the other modern tooling that is now in HDMF
- [Documentation]: Add installation instructions HOT 2
- [Bug]: Could not find already-built Builder for DynamicTable 'electrodes' in BuildManager HOT 3
- [Documentation]: favicon
- [Bug]: Min req tests failing on `import zarr` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hdmf-zarr.