This repo provides the code for the study on the impact of chunk size on read performance of multi-dimensional data in Zarr format stored in the cloud (AWS S3). Code is provided for rechunking a default Zarr archive, measuring the performance (mainly time and memory consumption) of different chunking strategies, and performance data visualization. This study was conducted as part of the Fall 2021/Spring 2022 internship at NASA Goddard Space Flight Center.
The complete list of required packages is provided in env-eisfire.yml, which you can install in your environment via conda with the command conda env create -f env-eisfire.yml
. Note that this code is set up on a cluster on AWS that uses slurm.
In this study, we use the GEOS-FP dataset in Zarr format stored in the AWS S3 bucket eis-dh-fire/geos-fp-global/
; specifically, the inst.zarr store and BCEXTTAU variable. The default chunking scheme: 5136 chunks in the time dimension, 1 chunk in longitude, and 1 chunk in latitude.
To rechunk the dataset into a different scheme (e.g., 5136 chunks in time, 100 in longitude, and 100 in latitude), navigate to the directory rechunk/
and modify the main()
function in the script run_rechunk.py for the variables time
, lat
, and lon
to take on desired values (single value or a list of values for each variable - the script will create unique combinations of the variables). Run the rechunking script with the command: python run_rechunk.py
to automatically launch a cluster job for each combination of variable values.
Job info and progress as well as any errors are stored in the .out
and .err
files in the sub-directory logs-slurm/
. The final output Zarr store is back written to S3 (eis-dh-fire/dieumynguyen_rechunked/geos-fp-global_inst/
).
After rechunking the dataset to various chunking schemes and storing the different versions of the dataset on S3, we track how the schemes perform for common data access and analysis operations (e.g., extracting a time series at a location or extracting a map or spatial slice at a datetime). Performance metrics include wall clock time, peak memory usage, the rechunking time, and Zarr store archive size.
Navigate to directory measure_performance/
.
- To obtain archive size data, run
sbatch measure_archive_size.sh
to submit a cluster job, which runs measure_archive_size.py. - To obtain rechunking time, run
sbatch measure_rechunking_time.sh
to submit a cluster job, which runs measure_rechunking_time.py. - To obtain wall time and peak memory usage for a given data operation, modify the selected operation in the
main()
function in measure_performance.py. Then, runsbatch measure_performance.sh
to submit a cluster job, which runs measure_performance.py).
Job info and progress as well as any errors are stored in the .out
and .err
files in the sub-directory performance-logs-slurm/
.
- Archive size data is saved in
data/geos-fp-global_inst/archive_sizes.csv
. - Rechunking time data is saved in
data/geos-fp-global_inst/rechunking_time.csv
. - Time and memory data for each operation are saved in
data/geos-fp-global_inst
with filename indicating the operation and number of trials/repetitions (e.g.,time_series_metrics_ntrials1.csv
).
The performance data generated in #2.
Navigate to directory visualization/
. Run sbatch visualize.sh
to submit a job to run visualize.py.
Heatmaps and scatterplots shown in paper, stored in data/geos-fp-global_inst/heatmaps
, data/geos-fp-global_inst/normalized_heatmaps
, and data/geos-fp-global_inst/scatterplots
.
Reference: Nguyen DMT, Cortes JC, Dunn MM, Shiklomanov AN (2022). Impact of Chunk Size on Read Performance of Zarr Data in Cloud-based Object Stores.