I am trying to use kerchunk to access a distributed collection of zarr arrays, parts o

No significant change in speed with <div class="highlight highlight-source-python

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Performance issue with combined zarr sources about kerchunk HOT 9 CLOSED

fsspec commented on July 30, 2024

Performance issue with combined zarr sources

from kerchunk.

Comments (9)

lsterzinger commented on July 30, 2024

What do the speeds look like if you add simple_templates=True to fsspec.get_mapper('reference://'...)? Or regenerate the combined metadata with mz2z.translate(combined_json, template_count=None)

from kerchunk.

dmedv commented on July 30, 2024

No significant change in speed with

mapper = fsspec.get_mapper('reference://', fo=combined_json, simple_templates=True)

Seems to be even slower by 20ms, but that's within the margin of error, I guess. Same result after regenerating metadata with template_count=None.

from kerchunk.

martindurant commented on July 30, 2024

I would love to play with optimisation on your dataset and see where the bottlenecks are. A huge advantage of kerchunk/referenceFS is, that the code is very small and simple python, so we can find and fix slow paths - or just suggest better default values when building your references. The existing kerchunk timings around tend to be for remote datasets, which is what it was really built for - and never at the ms time-scale.

from kerchunk.

dmedv commented on July 30, 2024

@martindurant Thank you. The dataset has not been made public yet, but I've come up with a synthetic test that I believe illustrates the same issue:

from kerchunk.zarr import single_zarr
from kerchunk.combine import MultiZarrToZarr
import fsspec
import zarr
import xarray
import numpy as np

combined_json = 'file:///home/idies/workspace/combined.json'

nx = 5000
ny = 5000
chunk_shape = (1, 10, 10)

refs = []
for i in range(1,3):
    path = f'data/example{i}.zarr'
    store = zarr.DirectoryStore(path)
    root = zarr.open_group(path, mode='w')

    t = root.zeros('t', shape=(1), chunks=(1), overwrite=True)
    x = root.zeros('x', shape=(nx), chunks=(nx), overwrite=True)
    y = root.zeros('y', shape=(ny), chunks=(ny), overwrite=True)
    t[0] = i - 1
    t.attrs['_ARRAY_DIMENSIONS'] = ['t']
    x[:] = np.arange(nx)
    x.attrs['_ARRAY_DIMENSIONS'] = ['x']
    y[:] = np.arange(ny)
    y.attrs['_ARRAY_DIMENSIONS'] = ['y']                        
    a = root.zeros('U', shape=(1, nx, ny), chunks=chunk_shape, overwrite=True)
    a[...] = np.random.rand(1, nx, ny)
    a.attrs['_ARRAY_DIMENSIONS'] = ['t','x','y']
    
    refs.append(single_zarr(path))

mz2z = MultiZarrToZarr(refs, remote_protocol='file', xarray_concat_args={'dim': 't'})
mz2z.translate(combined_json, template_count=None)

mapper = fsspec.get_mapper('reference://', fo=combined_json)

ds1 = zarr.open(mapper, mode='r')
%timeit ds1['U'][0, 0:10, 0:10]
# 494 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ds2 = zarr.open('data/example1.zarr', mode='r')
%timeit ds2['U'][0, 0:10, 0:10]
# 770 µs ± 5.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

It takes a few minutes to run and will produce a fairly large reference file, about 40MB. As you can see, the performance with kerchunk in this case is also much slower than with a single zarr array. Reducing the number of chunks improves kerchunk performance and in the trivial case with a single chunk per array it becomes just as fast as reading zarr directly.

If it helps, here is the layout of our zarr arrays in the actual dataset.

Array	Chunk
698.72 GB	74.65 MB
(8, 90, 13, 4320, 4320)	(1, 1, 1, 4320, 4320)
9361 Tasks	9360 Chunks
float32	numpy.ndarray

There are 30 such arrays distributed across 10 servers.
The combined array shape is (240, 90, 13, 4320, 4320) and contains 280800 chunks.
The kerchunk reference file in this case also turns out quite large, over 70MB.

from kerchunk.

martindurant commented on July 30, 2024

Bottleneck found and fixed in #892 .

Note that I recommend Zstd-compressing the JSON file for big output, because it always decompresses quickly, but with great compression ration for this kind of data. It might lead to faster instantiation of the mapper even on a local disc.

from kerchunk.

rabernat commented on July 30, 2024

Bottleneck found and fixed in #892 .

Which PR is this supposed to refer to Martin?

from kerchunk.

dmedv commented on July 30, 2024

@rabernat I guess it's this one: fsspec/filesystem_spec#892 I'll give it a try.

from kerchunk.

martindurant commented on July 30, 2024

Yep, sorry, copypaste failure

from kerchunk.

dmedv commented on July 30, 2024

I've tried it with the updated fsspec and can confirm that it is much faster now, comparable with reading zarr directly. Thank you, @martindurant, I appreciate the quick response.

from kerchunk.

Performance issue with combined zarr sources about kerchunk HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs