Comments (23)
Note that this issue may be relevant for handling scalar string datasets: fsspec/kerchunk#387 . The proposed fix in numcodecs has not yet been released.
from hdmf-zarr.
Some notes in advance of our chat.
My forked kerchunk now has a number of customizations. I am not sure what's the best strategy in terms of merging back, but for now I am making adjustments as needed.
Neurosift now supports .zarr.json files generated by (forked) kerchunk. I have a gh action that is iterating through all the nwb assets on dandi and preparing kerchunk files. The expectation of course is that these would all need to be replaced as we work out issues, but I wanted to see how time-consuming the processing would be. I have parallelization in place, etc.
Here's an example that loads really quickly in neurosift:
https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/download/
It points to the dandi hdf5 file, but neurosift internally checks the https://kerchunk.neurosift.org bucket for the corresponding .zarr.json file. If it finds it, it uses that. (you can also manually point to any .zarr.json file)
Many of these nwb files have thousands, tens of thousands, or even hundreds of thousands of chunks. It becomes impractical to kerchunk such files when streaming the remote file -- it's typically 1 network request per chunk! There's also a disadvantage of having such a large .zarr.json file because that takes longer to load in neurosift (or other tools). So, for now I have introduced a parameter in kerchunk-fork called num_chunks_per_dataset_threshold, and I have set it to 1000. So, for datasets with >1000 chunks, it doesn't include those directly in the .json file, but instead makes a link to the original nwb file, and the path of the dataset. I also do not process nwb files with >1000 items (some have an excessive number that would cause the .json file to be very large, and again time-consuming to generate).
In my opinion, we should not be creating Zarr datasets with tens of thousands of individual files. With kerchunk it is possible to consolidate this a lot. What I would propose is to have up to 1000 files that are around 1GB in size, and then a single .zarr.json file that references locations within those files. Then you would have all the advantages of efficient zarr storage, but not the excessive number of files. Also, you wouldn't need to put everything in a single .zarr.json, you could spread the references among a hierarchy of those files.
My goal with all this is to be able to augment existing NWB files on DANDI by adding objects without downloading all the original data. Right now, you can create new/derivative NWB files that either duplicate a lot of data from the upstream file or are lacking in that data. Both cases are suboptimal. What the .zarr.json allows us to do is to include the data from the upstream file without downloading or duplicating it. So the flow would be... point to a .zarr.json file as the base nwb file, do some processing (streaming the data), produce new NWB items, add those into a new .zarr.json file that contains both old and new objects as references... and then share the new derivative as a json file. Ideally there would be some mechanism of submitting that to DANDI as well.
To start exploring this possibility, I tried to use fsspec/ReferenceFileSystem, zarr.open, and NWBZarrIO to load a kerchunk'd NWB file. After some adjustments to hdmf_zarr (to allow path to be a zarr.Group), it worked! At least for one example... for other examples I am getting various NWB errors. To give an idea on how this works:
# Note: this only works after some tweaks to hdmf_zarr to allow path to be a zarr.Group
from fsspec.implementations.reference import ReferenceFileSystem
import zarr
from hdmf_zarr.nwb import NWBZarrIO
import pynwb
import remfile
import h5py
# This one seems to load properly
# https://neurosift.app/?p=/nwb&dandisetId=000717&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/3d12a902-139a-4c1a-8fd0-0a7faf2fb223/download/
h5_url = 'https://api.dandiarchive.org/api/assets/3d12a902-139a-4c1a-8fd0-0a7faf2fb223/download/'
json_url = 'https://kerchunk.neurosift.org/dandi/dandisets/000717/assets/3d12a902-139a-4c1a-8fd0-0a7faf2fb223/zarr.json'
# This fails with error: Could not construct ImageSeries object due to: ImageSeries.__init__: incorrect type for 'starting_frame' (got 'int', expected 'Iterable')
# https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/download/
# h5_url = 'https://api.dandiarchive.org/api/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/download/'
# json_url = 'https://kerchunk.neurosift.org/dandi/dandisets/000409/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/zarr.json'
def load_with_kerchunk():
fs = ReferenceFileSystem(json_url)
store = fs.get_mapper(root='/', check=False)
root = zarr.open(store, mode='r')
# Load NWB file
with NWBZarrIO(path=root, mode="r", load_namespaces=True) as io:
nwbfile = io.read()
print(nwbfile)
print('********************************')
print(nwbfile.acquisition)
def load_with_h5_streaming():
remf = remfile.File(h5_url)
h5f = h5py.File(remf, mode='r')
with pynwb.NWBHDF5IO(file=h5f, mode='r', load_namespaces=True) as io:
nwbfile = io.read()
print(nwbfile)
print('********************************')
print(nwbfile.acquisition)
if __name__ == "__main__":
load_with_kerchunk()
# load_with_h5_streaming()
Even with the data hosted remotely, this loads in a fraction of a second, because all the metadata comes down in one shot.
I thought a bit about how one would write new objects to the loaded file, and I think that is going to require a custom zarr storage backend.
Well that's a lot of info all at once... just thought I'd put it out there in advance of the meeting.
from hdmf-zarr.
What are the key differences between the way hdmf-zarr is doing the conversion vs. the way kerchunk is doing it?
from hdmf-zarr.
What are the key differences between the way hdmf-zarr is doing the conversion vs. the way kerchunk is doing it?
Here's the first thing I looked at. acquisition/v_in/data/.zattrs
kerchunk version:
{\"_ARRAY_DIMENSIONS\":[\"phony_dim_0\"],\"conversion\":1.0,\"offset\":0.0,\"resolution\":-1.0,\"unit\":\"V\"}
hdmf-zarr version:
{
"conversion": 1.0,
"offset": 0.0,
"resolution": -1.0,
"unit": "V",
"zarr_dtype": "float64"
}
So in this case, the difference is that hdmf-zarr adds zarr_dtype
whereas kerchunk adds _ARRAY_DIMENSIONS
Regarding "phony_dim_0", I found this in the kerchunk source code
I'll keep looking for other differences.
from hdmf-zarr.
Another difference. In the acquisition/v_in/data/.zarray
hdmf-zarr uses 2 chunks whereas kerchunk uses 1 chunk -- the original hdf5 file uses 1 chunk, so I'm not sure why hdmf-zarr is splitting into two (the shape is [53971]). I guess this difference doesn't matter too much... but I am curious about how hdmf-zarr decides about chunking.
Here's the next thing I looked at:
In file_create_date/.zattrs
:
kerchunk version
{\"_ARRAY_DIMENSIONS\":[\"phony_dim_0\"]}
hdmf-zarr version
{
"zarr_dtype": "object_"
}
from hdmf-zarr.
Another small difference: For session start time, kerchunk doesn't have any attributes, but hdmf-zarr has zarr_dtype: scalar
Maybe these differences aren't too big. Here are the questions I have.
- Is hdmf-zarr going to be able to read zarr output from kerchunk? Or does it require the
zarr_dtype
to be provided on all datasets? - Do we fork kerchunk so that it can handle references, scalar datasets, etc?
- Or do we update hdmf-zarr to have the functionality of kerchunk? i.e. generating a .zarr.json file
from hdmf-zarr.
but hdmf-zarr has
zarr_dtype: scalar
We have a couple of reserved attributes that hdmf-zarr adds; see https://hdmf-zarr.readthedocs.io/en/latest/storage.html#reserved-attributes. The main reason these exist is because Zarr does not support object references and links. As such, we had to implement support for links and references in hdmf-zarr. See the storage model for links and object references for details on how this is implement. I believe we set zarr_dtype
attribute in all cases (although I think it is strictly only necessary to identify links and references).
Another difference. In the
acquisition/v_in/data/.zarray
hdmf-zarr uses 2 chunks whereas kerchunk uses 1 chunk
Kerchunk (as far as I know) only indexes the HDF5 file, i.e., it does not convert the data. As such, kerchunk is (I believe) just representing how the data is layed out in the HDF5 file. In contrast, when converting data with hdmf-zarr we are creating a new Zarr file, so the layout of data can change. This PR #153 adds functionality to try and preserve the chunking and compression settings that are used in an HDF5 file as much as possible. When chunking is not specified, I believe the Zarr library does its own best guess in how it wants to break the data into chunks. My guess is that the change in chunking from 1 to 2 chunks is probably due to the chunking not being specified during conversion and Zarr making its own guess in how to store it. I think #153 should fix this for conversion from HDF5 to Zarr.
from hdmf-zarr.
Thanks @oruebel . Somehow I didn't come across those docs. Very helpful.
I think I understand the requirements now, so this issue can be closed.
from hdmf-zarr.
I did a monkey patch of kerchunk to handle those cases in a way compatible with hdmf-zarr.
@magland Very interesting. We were also planning to look at kerchunk as an option to use through hdmf-zarr to facilitate remote access to HDF5 files. Would be great to sync up.
from hdmf-zarr.
As expected, there were some things that couldn't be translated (references and scalar datasets). I did a monkey patch of kerchunk to handle those cases in a way compatible with hdmf-zarr.
Thanks for sharing the JSON. Could you point me to where a link and object reference is being translated? I didn't see the zarr_dtype
anywhere to tell hdmf_zarr to handle the links.
from hdmf-zarr.
@rly @sinha-r the discussion here is relevant to the nwb cloud benchmark efforts
from hdmf-zarr.
I think I understand the requirements now, so this issue can be closed.
@magland Ok, will close for now. I think it would still be useful to have a chat with @rly and @sinha-r to make sure what we plan to do with kerchunk lines up with your needs.
from hdmf-zarr.
As expected, there were some things that couldn't be translated (references and scalar datasets). I did a monkey patch of kerchunk to handle those cases in a way compatible with hdmf-zarr.
Thanks for sharing the JSON. Could you point me to where a link and object reference is being translated? I didn't see the
zarr_dtype
anywhere to tell hdmf_zarr to handle the links.
Oops I accidentally shared the version of the file before I did the patches. Here's the one which includes the references.
Search for .specloc and \"target\"
from hdmf-zarr.
I think I understand the requirements now, so this issue can be closed.
@magland Ok, will close for now. I think it would still be useful to have a chat with @rly and @sinha-r to make sure what we plan to do with kerchunk lines up with your needs.
Happy to chat
from hdmf-zarr.
Oops I accidentally shared the version of the file before I did the patches. Here's the one which includes the references.
Thanks for sharing this updated version. On read, I think the attributes that have links here will probably not be resolved correctly by hdfm-zarr, because the zarr_dtype
is missing. The relevant code for reading object references stored in attributes is here
hdmf-zarr/src/hdmf_zarr/backend.py
Lines 1486 to 1493 in 556ed12
from hdmf-zarr.
Happy to chat
Great! We'll reach out to schedule a chat.
from hdmf-zarr.
I made a fork of kerchunk to add what I am calling hdmf_mode parameter
fsspec/kerchunk@main...magland:kerchunk-fork:hdmf
(note this is the hdmf branch of my fork)
It can be tested via
import json
import h5py
import remfile
import kerchunk.hdf
# 000713 - Allen Institute - Visual Behavior - Neuropixels
# https://neurosift.app/?p=/nwb&url=https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/&dandisetId=000713&dandisetVersion=draft
url = "https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/"
# Translate remote hdf5 to local .zarr.json')
remf = remfile.File(url, verbose=False)
with h5py.File(remf, 'r') as f:
grp = f
h5chunks = kerchunk.hdf.SingleHdf5ToZarr(grp, url=url, hdmf_mode=True)
a = h5chunks.translate()
with open('example1.zarr.json', 'w') as g:
json.dump(a, g, indent=2)
This gives one warning for a dataset with embedded references. See note in source code.
from hdmf-zarr.
Great! Thank you @magland . I will take a look. I hope that some or all of these changes can be integrated back into kerchunk at some point.
from hdmf-zarr.
Amazing! I can't wait to play around with this 😁
from hdmf-zarr.
@bendichter In our meeting we came up with a game plan for how to move forward, and we'll provide updates once we have made some progress on that.
from hdmf-zarr.
Thanks for the great discussions! I'm closing this issue for now just for house keeping. Feel free to reopen if necessary.
from hdmf-zarr.
@magland sounds great. Would you mind sharing a summary of the plan?
from hdmf-zarr.
@bendichter we're working on something called Linked Data Interface (LINDI). Feel free to comment or contribute over there.
from hdmf-zarr.
Related Issues (20)
- [Feature]: Write zarr without using pickle HOT 5
- [Feature]: write `xarray`-compatible Zarr files HOT 9
- [Documentation]: `linkable` key has been deprecated HOT 2
- [Bug]: `export` fails to correctly save units after adding columns HOT 7
- [Feature]: Support pathlib.Path in ZarrIO
- [Bug]: NWBZarrIO appending HOT 7
- [Bug]: Remote read with/without consolidated metadata is not being tested
- Zarr datasets info lack compression data HOT 3
- [Feature]: Use `copy_store` for copying existing zarr data HOT 5
- [Bug]: Zarr 2.18.0 with Blosc HOT 5
- [Feature]: Explore use of Zarrita HOT 2
- [Bug]: Writing NWB with `experimenter` (or any `ArrayLike[str]`) fails HOT 4
- [Feature]: NWBZarrIO should have load_namespaces=True by default HOT 2
- [Bug]: Pre-release daily workflows failing with zarr 3.0.0a0 HOT 2
- [Feature]: Support zarr-python v3
- [Bug]: `[0.7.0, 0.8.0]` Fails to open file with consolidated metadata from S3 HOT 21
- [Bug]: Reading and exporting Zarr NWB fails to copy over table information HOT 3
- [Feature]: Support numpy 2.0
- [Bug]: NWB Zarr to HDMF export fails HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hdmf-zarr.