GithubHelp home page GithubHelp logo

kerchunk considerations about hdmf-zarr HOT 23 CLOSED

magland avatar magland commented on August 16, 2024
kerchunk considerations

from hdmf-zarr.

Comments (23)

rly avatar rly commented on August 16, 2024 1

Note that this issue may be relevant for handling scalar string datasets: fsspec/kerchunk#387 . The proposed fix in numcodecs has not yet been released.

from hdmf-zarr.

magland avatar magland commented on August 16, 2024 1

Some notes in advance of our chat.

My forked kerchunk now has a number of customizations. I am not sure what's the best strategy in terms of merging back, but for now I am making adjustments as needed.

Neurosift now supports .zarr.json files generated by (forked) kerchunk. I have a gh action that is iterating through all the nwb assets on dandi and preparing kerchunk files. The expectation of course is that these would all need to be replaced as we work out issues, but I wanted to see how time-consuming the processing would be. I have parallelization in place, etc.

Here's an example that loads really quickly in neurosift:
https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/download/

It points to the dandi hdf5 file, but neurosift internally checks the https://kerchunk.neurosift.org bucket for the corresponding .zarr.json file. If it finds it, it uses that. (you can also manually point to any .zarr.json file)

Many of these nwb files have thousands, tens of thousands, or even hundreds of thousands of chunks. It becomes impractical to kerchunk such files when streaming the remote file -- it's typically 1 network request per chunk! There's also a disadvantage of having such a large .zarr.json file because that takes longer to load in neurosift (or other tools). So, for now I have introduced a parameter in kerchunk-fork called num_chunks_per_dataset_threshold, and I have set it to 1000. So, for datasets with >1000 chunks, it doesn't include those directly in the .json file, but instead makes a link to the original nwb file, and the path of the dataset. I also do not process nwb files with >1000 items (some have an excessive number that would cause the .json file to be very large, and again time-consuming to generate).

In my opinion, we should not be creating Zarr datasets with tens of thousands of individual files. With kerchunk it is possible to consolidate this a lot. What I would propose is to have up to 1000 files that are around 1GB in size, and then a single .zarr.json file that references locations within those files. Then you would have all the advantages of efficient zarr storage, but not the excessive number of files. Also, you wouldn't need to put everything in a single .zarr.json, you could spread the references among a hierarchy of those files.

My goal with all this is to be able to augment existing NWB files on DANDI by adding objects without downloading all the original data. Right now, you can create new/derivative NWB files that either duplicate a lot of data from the upstream file or are lacking in that data. Both cases are suboptimal. What the .zarr.json allows us to do is to include the data from the upstream file without downloading or duplicating it. So the flow would be... point to a .zarr.json file as the base nwb file, do some processing (streaming the data), produce new NWB items, add those into a new .zarr.json file that contains both old and new objects as references... and then share the new derivative as a json file. Ideally there would be some mechanism of submitting that to DANDI as well.

To start exploring this possibility, I tried to use fsspec/ReferenceFileSystem, zarr.open, and NWBZarrIO to load a kerchunk'd NWB file. After some adjustments to hdmf_zarr (to allow path to be a zarr.Group), it worked! At least for one example... for other examples I am getting various NWB errors. To give an idea on how this works:

# Note: this only works after some tweaks to hdmf_zarr to allow path to be a zarr.Group

from fsspec.implementations.reference import ReferenceFileSystem
import zarr
from hdmf_zarr.nwb import NWBZarrIO
import pynwb
import remfile
import h5py

# This one seems to load properly
# https://neurosift.app/?p=/nwb&dandisetId=000717&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/3d12a902-139a-4c1a-8fd0-0a7faf2fb223/download/
h5_url = 'https://api.dandiarchive.org/api/assets/3d12a902-139a-4c1a-8fd0-0a7faf2fb223/download/'
json_url = 'https://kerchunk.neurosift.org/dandi/dandisets/000717/assets/3d12a902-139a-4c1a-8fd0-0a7faf2fb223/zarr.json'


# This fails with error: Could not construct ImageSeries object due to: ImageSeries.__init__: incorrect type for 'starting_frame' (got 'int', expected 'Iterable')
# https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/download/
# h5_url = 'https://api.dandiarchive.org/api/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/download/'
# json_url = 'https://kerchunk.neurosift.org/dandi/dandisets/000409/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/zarr.json'


def load_with_kerchunk():
    fs = ReferenceFileSystem(json_url)
    store = fs.get_mapper(root='/', check=False)
    root = zarr.open(store, mode='r')

    # Load NWB file
    with NWBZarrIO(path=root, mode="r", load_namespaces=True) as io:
        nwbfile = io.read()
        print(nwbfile)
        print('********************************')
        print(nwbfile.acquisition)


def load_with_h5_streaming():
    remf = remfile.File(h5_url)
    h5f = h5py.File(remf, mode='r')
    with pynwb.NWBHDF5IO(file=h5f, mode='r', load_namespaces=True) as io:
        nwbfile = io.read()
        print(nwbfile)
        print('********************************')
        print(nwbfile.acquisition)


if __name__ == "__main__":
    load_with_kerchunk()
    # load_with_h5_streaming()

Even with the data hosted remotely, this loads in a fraction of a second, because all the metadata comes down in one shot.

I thought a bit about how one would write new objects to the loaded file, and I think that is going to require a custom zarr storage backend.

Well that's a lot of info all at once... just thought I'd put it out there in advance of the meeting.

from hdmf-zarr.

bendichter avatar bendichter commented on August 16, 2024

What are the key differences between the way hdmf-zarr is doing the conversion vs. the way kerchunk is doing it?

from hdmf-zarr.

magland avatar magland commented on August 16, 2024

What are the key differences between the way hdmf-zarr is doing the conversion vs. the way kerchunk is doing it?

Here's the first thing I looked at. acquisition/v_in/data/.zattrs

kerchunk version:

{\"_ARRAY_DIMENSIONS\":[\"phony_dim_0\"],\"conversion\":1.0,\"offset\":0.0,\"resolution\":-1.0,\"unit\":\"V\"}

hdmf-zarr version:

{
    "conversion": 1.0,
    "offset": 0.0,
    "resolution": -1.0,
    "unit": "V",
    "zarr_dtype": "float64"
}

So in this case, the difference is that hdmf-zarr adds zarr_dtype whereas kerchunk adds _ARRAY_DIMENSIONS

Regarding "phony_dim_0", I found this in the kerchunk source code

https://github.com/fsspec/kerchunk/blob/6fe1f0aa6d33d856ca416bc13a290e2276d3bdb1/kerchunk/hdf.py#L544-L549

I'll keep looking for other differences.

from hdmf-zarr.

magland avatar magland commented on August 16, 2024

Another difference. In the acquisition/v_in/data/.zarray hdmf-zarr uses 2 chunks whereas kerchunk uses 1 chunk -- the original hdf5 file uses 1 chunk, so I'm not sure why hdmf-zarr is splitting into two (the shape is [53971]). I guess this difference doesn't matter too much... but I am curious about how hdmf-zarr decides about chunking.

Here's the next thing I looked at:
In file_create_date/.zattrs:

kerchunk version

{\"_ARRAY_DIMENSIONS\":[\"phony_dim_0\"]}

hdmf-zarr version

{
    "zarr_dtype": "object_"
}

from hdmf-zarr.

magland avatar magland commented on August 16, 2024

Another small difference: For session start time, kerchunk doesn't have any attributes, but hdmf-zarr has zarr_dtype: scalar

Maybe these differences aren't too big. Here are the questions I have.

  • Is hdmf-zarr going to be able to read zarr output from kerchunk? Or does it require the zarr_dtype to be provided on all datasets?
  • Do we fork kerchunk so that it can handle references, scalar datasets, etc?
  • Or do we update hdmf-zarr to have the functionality of kerchunk? i.e. generating a .zarr.json file

from hdmf-zarr.

oruebel avatar oruebel commented on August 16, 2024

but hdmf-zarr has zarr_dtype: scalar

We have a couple of reserved attributes that hdmf-zarr adds; see https://hdmf-zarr.readthedocs.io/en/latest/storage.html#reserved-attributes. The main reason these exist is because Zarr does not support object references and links. As such, we had to implement support for links and references in hdmf-zarr. See the storage model for links and object references for details on how this is implement. I believe we set zarr_dtype attribute in all cases (although I think it is strictly only necessary to identify links and references).

Another difference. In the acquisition/v_in/data/.zarray hdmf-zarr uses 2 chunks whereas kerchunk uses 1 chunk

Kerchunk (as far as I know) only indexes the HDF5 file, i.e., it does not convert the data. As such, kerchunk is (I believe) just representing how the data is layed out in the HDF5 file. In contrast, when converting data with hdmf-zarr we are creating a new Zarr file, so the layout of data can change. This PR #153 adds functionality to try and preserve the chunking and compression settings that are used in an HDF5 file as much as possible. When chunking is not specified, I believe the Zarr library does its own best guess in how it wants to break the data into chunks. My guess is that the change in chunking from 1 to 2 chunks is probably due to the chunking not being specified during conversion and Zarr making its own guess in how to store it. I think #153 should fix this for conversion from HDF5 to Zarr.

from hdmf-zarr.

magland avatar magland commented on August 16, 2024

Thanks @oruebel . Somehow I didn't come across those docs. Very helpful.

I think I understand the requirements now, so this issue can be closed.

from hdmf-zarr.

oruebel avatar oruebel commented on August 16, 2024

I did a monkey patch of kerchunk to handle those cases in a way compatible with hdmf-zarr.

@magland Very interesting. We were also planning to look at kerchunk as an option to use through hdmf-zarr to facilitate remote access to HDF5 files. Would be great to sync up.

from hdmf-zarr.

oruebel avatar oruebel commented on August 16, 2024

As expected, there were some things that couldn't be translated (references and scalar datasets). I did a monkey patch of kerchunk to handle those cases in a way compatible with hdmf-zarr.

Thanks for sharing the JSON. Could you point me to where a link and object reference is being translated? I didn't see the zarr_dtype anywhere to tell hdmf_zarr to handle the links.

from hdmf-zarr.

oruebel avatar oruebel commented on August 16, 2024

@rly @sinha-r the discussion here is relevant to the nwb cloud benchmark efforts

from hdmf-zarr.

oruebel avatar oruebel commented on August 16, 2024

I think I understand the requirements now, so this issue can be closed.

@magland Ok, will close for now. I think it would still be useful to have a chat with @rly and @sinha-r to make sure what we plan to do with kerchunk lines up with your needs.

from hdmf-zarr.

magland avatar magland commented on August 16, 2024

As expected, there were some things that couldn't be translated (references and scalar datasets). I did a monkey patch of kerchunk to handle those cases in a way compatible with hdmf-zarr.

Thanks for sharing the JSON. Could you point me to where a link and object reference is being translated? I didn't see the zarr_dtype anywhere to tell hdmf_zarr to handle the links.

Oops I accidentally shared the version of the file before I did the patches. Here's the one which includes the references.

example1.zarr.json

Search for .specloc and \"target\"

from hdmf-zarr.

magland avatar magland commented on August 16, 2024

I think I understand the requirements now, so this issue can be closed.

@magland Ok, will close for now. I think it would still be useful to have a chat with @rly and @sinha-r to make sure what we plan to do with kerchunk lines up with your needs.

Happy to chat

from hdmf-zarr.

oruebel avatar oruebel commented on August 16, 2024

Oops I accidentally shared the version of the file before I did the patches. Here's the one which includes the references.

Thanks for sharing this updated version. On read, I think the attributes that have links here will probably not be resolved correctly by hdfm-zarr, because the zarr_dtype is missing. The relevant code for reading object references stored in attributes is here

if isinstance(v, dict) and 'zarr_dtype' in v:
if v['zarr_dtype'] == 'object':
target_name, target_zarr_obj = self.resolve_ref(v['value'])
if isinstance(target_zarr_obj, zarr.hierarchy.Group):
ret[k] = self.__read_group(target_zarr_obj, target_name)
else:
ret[k] = self.__read_dataset(target_zarr_obj, target_name)
# TODO Need to implement region references for attributes
and an example of how the JSON should be formatted for the attribute is here https://hdmf-zarr.readthedocs.io/en/latest/storage.html#storing-object-references-in-attributes

from hdmf-zarr.

oruebel avatar oruebel commented on August 16, 2024

Happy to chat

Great! We'll reach out to schedule a chat.

from hdmf-zarr.

magland avatar magland commented on August 16, 2024

I made a fork of kerchunk to add what I am calling hdmf_mode parameter

fsspec/kerchunk@main...magland:kerchunk-fork:hdmf

(note this is the hdmf branch of my fork)

It can be tested via

import json
import h5py
import remfile
import kerchunk.hdf


# 000713 - Allen Institute - Visual Behavior - Neuropixels
# https://neurosift.app/?p=/nwb&url=https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/&dandisetId=000713&dandisetVersion=draft
url = "https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/"


# Translate remote hdf5 to local .zarr.json')
remf = remfile.File(url, verbose=False)
with h5py.File(remf, 'r') as f:
    grp = f
    h5chunks = kerchunk.hdf.SingleHdf5ToZarr(grp, url=url, hdmf_mode=True)
    a = h5chunks.translate()
    with open('example1.zarr.json', 'w') as g:
        json.dump(a, g, indent=2)

This gives one warning for a dataset with embedded references. See note in source code.

from hdmf-zarr.

rly avatar rly commented on August 16, 2024

Great! Thank you @magland . I will take a look. I hope that some or all of these changes can be integrated back into kerchunk at some point.

from hdmf-zarr.

bendichter avatar bendichter commented on August 16, 2024

Amazing! I can't wait to play around with this 😁

from hdmf-zarr.

magland avatar magland commented on August 16, 2024

@bendichter In our meeting we came up with a game plan for how to move forward, and we'll provide updates once we have made some progress on that.

from hdmf-zarr.

oruebel avatar oruebel commented on August 16, 2024

Thanks for the great discussions! I'm closing this issue for now just for house keeping. Feel free to reopen if necessary.

from hdmf-zarr.

bendichter avatar bendichter commented on August 16, 2024

@magland sounds great. Would you mind sharing a summary of the plan?

from hdmf-zarr.

magland avatar magland commented on August 16, 2024

@bendichter we're working on something called Linked Data Interface (LINDI). Feel free to comment or contribute over there.

from hdmf-zarr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.