GithubHelp home page GithubHelp logo

Comments (9)

lsterzinger avatar lsterzinger commented on July 30, 2024

What do the speeds look like if you add simple_templates=True to fsspec.get_mapper('reference://'...)? Or regenerate the combined metadata with mz2z.translate(combined_json, template_count=None)

from kerchunk.

dmedv avatar dmedv commented on July 30, 2024

No significant change in speed with

mapper = fsspec.get_mapper('reference://', fo=combined_json, simple_templates=True)

Seems to be even slower by 20ms, but that's within the margin of error, I guess. Same result after regenerating metadata with template_count=None.

from kerchunk.

martindurant avatar martindurant commented on July 30, 2024

I would love to play with optimisation on your dataset and see where the bottlenecks are. A huge advantage of kerchunk/referenceFS is, that the code is very small and simple python, so we can find and fix slow paths - or just suggest better default values when building your references. The existing kerchunk timings around tend to be for remote datasets, which is what it was really built for - and never at the ms time-scale.

from kerchunk.

dmedv avatar dmedv commented on July 30, 2024

@martindurant Thank you. The dataset has not been made public yet, but I've come up with a synthetic test that I believe illustrates the same issue:

from kerchunk.zarr import single_zarr
from kerchunk.combine import MultiZarrToZarr
import fsspec
import zarr
import xarray
import numpy as np

combined_json = 'file:///home/idies/workspace/combined.json'

nx = 5000
ny = 5000
chunk_shape = (1, 10, 10)

refs = []
for i in range(1,3):
    path = f'data/example{i}.zarr'
    store = zarr.DirectoryStore(path)
    root = zarr.open_group(path, mode='w')

    t = root.zeros('t', shape=(1), chunks=(1), overwrite=True)
    x = root.zeros('x', shape=(nx), chunks=(nx), overwrite=True)
    y = root.zeros('y', shape=(ny), chunks=(ny), overwrite=True)
    t[0] = i - 1
    t.attrs['_ARRAY_DIMENSIONS'] = ['t']
    x[:] = np.arange(nx)
    x.attrs['_ARRAY_DIMENSIONS'] = ['x']
    y[:] = np.arange(ny)
    y.attrs['_ARRAY_DIMENSIONS'] = ['y']                        
    a = root.zeros('U', shape=(1, nx, ny), chunks=chunk_shape, overwrite=True)
    a[...] = np.random.rand(1, nx, ny)
    a.attrs['_ARRAY_DIMENSIONS'] = ['t','x','y']
    
    refs.append(single_zarr(path))

mz2z = MultiZarrToZarr(refs, remote_protocol='file', xarray_concat_args={'dim': 't'})
mz2z.translate(combined_json, template_count=None)

mapper = fsspec.get_mapper('reference://', fo=combined_json)

ds1 = zarr.open(mapper, mode='r')
%timeit ds1['U'][0, 0:10, 0:10]
# 494 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ds2 = zarr.open('data/example1.zarr', mode='r')
%timeit ds2['U'][0, 0:10, 0:10]
# 770 µs ± 5.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

It takes a few minutes to run and will produce a fairly large reference file, about 40MB. As you can see, the performance with kerchunk in this case is also much slower than with a single zarr array. Reducing the number of chunks improves kerchunk performance and in the trivial case with a single chunk per array it becomes just as fast as reading zarr directly.

If it helps, here is the layout of our zarr arrays in the actual dataset.

Array Chunk
698.72 GB 74.65 MB
(8, 90, 13, 4320, 4320) (1, 1, 1, 4320, 4320)
9361 Tasks 9360 Chunks
float32 numpy.ndarray

There are 30 such arrays distributed across 10 servers.
The combined array shape is (240, 90, 13, 4320, 4320) and contains 280800 chunks.
The kerchunk reference file in this case also turns out quite large, over 70MB.

from kerchunk.

martindurant avatar martindurant commented on July 30, 2024

Bottleneck found and fixed in #892 .

Note that I recommend Zstd-compressing the JSON file for big output, because it always decompresses quickly, but with great compression ration for this kind of data. It might lead to faster instantiation of the mapper even on a local disc.

from kerchunk.

rabernat avatar rabernat commented on July 30, 2024

Bottleneck found and fixed in #892 .

Which PR is this supposed to refer to Martin?

from kerchunk.

dmedv avatar dmedv commented on July 30, 2024

@rabernat I guess it's this one: fsspec/filesystem_spec#892 I'll give it a try.

from kerchunk.

martindurant avatar martindurant commented on July 30, 2024

Yep, sorry, copypaste failure

from kerchunk.

dmedv avatar dmedv commented on July 30, 2024

I've tried it with the updated fsspec and can confirm that it is much faster now, comparable with reading zarr directly. Thank you, @martindurant, I appreciate the quick response.

from kerchunk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.