Comments (9)
What do the speeds look like if you add simple_templates=True
to fsspec.get_mapper('reference://'...)
? Or regenerate the combined metadata with mz2z.translate(combined_json, template_count=None)
from kerchunk.
No significant change in speed with
mapper = fsspec.get_mapper('reference://', fo=combined_json, simple_templates=True)
Seems to be even slower by 20ms, but that's within the margin of error, I guess. Same result after regenerating metadata with template_count=None
.
from kerchunk.
I would love to play with optimisation on your dataset and see where the bottlenecks are. A huge advantage of kerchunk/referenceFS is, that the code is very small and simple python, so we can find and fix slow paths - or just suggest better default values when building your references. The existing kerchunk timings around tend to be for remote datasets, which is what it was really built for - and never at the ms time-scale.
from kerchunk.
@martindurant Thank you. The dataset has not been made public yet, but I've come up with a synthetic test that I believe illustrates the same issue:
from kerchunk.zarr import single_zarr
from kerchunk.combine import MultiZarrToZarr
import fsspec
import zarr
import xarray
import numpy as np
combined_json = 'file:///home/idies/workspace/combined.json'
nx = 5000
ny = 5000
chunk_shape = (1, 10, 10)
refs = []
for i in range(1,3):
path = f'data/example{i}.zarr'
store = zarr.DirectoryStore(path)
root = zarr.open_group(path, mode='w')
t = root.zeros('t', shape=(1), chunks=(1), overwrite=True)
x = root.zeros('x', shape=(nx), chunks=(nx), overwrite=True)
y = root.zeros('y', shape=(ny), chunks=(ny), overwrite=True)
t[0] = i - 1
t.attrs['_ARRAY_DIMENSIONS'] = ['t']
x[:] = np.arange(nx)
x.attrs['_ARRAY_DIMENSIONS'] = ['x']
y[:] = np.arange(ny)
y.attrs['_ARRAY_DIMENSIONS'] = ['y']
a = root.zeros('U', shape=(1, nx, ny), chunks=chunk_shape, overwrite=True)
a[...] = np.random.rand(1, nx, ny)
a.attrs['_ARRAY_DIMENSIONS'] = ['t','x','y']
refs.append(single_zarr(path))
mz2z = MultiZarrToZarr(refs, remote_protocol='file', xarray_concat_args={'dim': 't'})
mz2z.translate(combined_json, template_count=None)
mapper = fsspec.get_mapper('reference://', fo=combined_json)
ds1 = zarr.open(mapper, mode='r')
%timeit ds1['U'][0, 0:10, 0:10]
# 494 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ds2 = zarr.open('data/example1.zarr', mode='r')
%timeit ds2['U'][0, 0:10, 0:10]
# 770 µs ± 5.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It takes a few minutes to run and will produce a fairly large reference file, about 40MB. As you can see, the performance with kerchunk in this case is also much slower than with a single zarr array. Reducing the number of chunks improves kerchunk performance and in the trivial case with a single chunk per array it becomes just as fast as reading zarr directly.
If it helps, here is the layout of our zarr arrays in the actual dataset.
Array | Chunk |
---|---|
698.72 GB | 74.65 MB |
(8, 90, 13, 4320, 4320) | (1, 1, 1, 4320, 4320) |
9361 Tasks | 9360 Chunks |
float32 | numpy.ndarray |
There are 30 such arrays distributed across 10 servers.
The combined array shape is (240, 90, 13, 4320, 4320) and contains 280800 chunks.
The kerchunk reference file in this case also turns out quite large, over 70MB.
from kerchunk.
Bottleneck found and fixed in #892 .
Note that I recommend Zstd-compressing the JSON file for big output, because it always decompresses quickly, but with great compression ration for this kind of data. It might lead to faster instantiation of the mapper even on a local disc.
from kerchunk.
Bottleneck found and fixed in #892 .
Which PR is this supposed to refer to Martin?
from kerchunk.
@rabernat I guess it's this one: fsspec/filesystem_spec#892 I'll give it a try.
from kerchunk.
Yep, sorry, copypaste failure
from kerchunk.
I've tried it with the updated fsspec and can confirm that it is much faster now, comparable with reading zarr directly. Thank you, @martindurant, I appreciate the quick response.
from kerchunk.
Related Issues (20)
- Kerchunk and Zarr V3 HOT 1
- Problem using scan_grib with GEFS output HOT 2
- Using kerchunk to reference large sets of netcdf4 files HOT 11
- Time to list a big reference set HOT 2
- gridftp HOT 3
- How to avoid variables with no dimensions from picking up concat_dim? HOT 9
- rewrite arrays
- fsspec v2022.10.0 breaks `MultiZarrToZarr` HOT 6
- Latitude and Longitude are not loaded in for GFS Grib2 datasets HOT 10
- IndexError with kerchunk.tiff.tiff_to_zarr HOT 5
- Loading xarray based on kerchunke'd catalogue may sometime get NaN from unstable http access? HOT 1
- Compression and filters as properties of chunk instead of variable HOT 1
- Grib2 scan_grib: Dimensions are Out of Order HOT 14
- Extracting Kerchunk jsons from NetCDF removes data variables HOT 4
- Support Python 3.11. HOT 3
- Kerchunk fails to produce `Data variables` from NASA ATL08 data. HOT 8
- request: tqdm on kerchunk.hdf.SingleHdf5ToZarr.translate HOT 15
- _ARRAY_DIMENSIONS attribute missing from arrays with object dtype HOT 6
- Support for `zlib` compressed archives (`gzip`/`zip`) HOT 4
- Unexpected time result with MultiZarrToZarr with differing time units HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kerchunk.