cbyrohl / scida Goto Github PK

scida is an out-of-the-box analysis tool for large scientific datasets. It primarily supports the astrophysics community, focusing on cosmological and galaxy formation simulations using particles or unstructured meshes, as well as large observational datasets. This tool uses dask, allowing analysis to scale.

Home Page: https://scida.io

License: MIT License

Python 99.88% Makefile 0.12%

scida's Introduction

scida

Features

Unified, high-level interface to load and analyze large datasets from a variety of sources.
Parallel, task-based data processing with dask arrays.
Physical unit support via pint.
Easily extensible architecture.

Requirements

Python 3.9, 3.10, 3.11, 3.12

Documentation

The documentation can be found here.

Install

pip install scida

First Steps

After installing scida, follow the tutorial.

Citation

If you use scida in your research, please cite the following paper:

`Byrohl et al., (2024). scida: scalable analysis for scientific big data. Journal of Open Source Software, 9(94), 6064, https://doi.org/10.21105/joss.06064`

with the following bibtex entry:

@article{scida,
  title = {scida: scalable analysis for scientific big data},
  author = {Chris Byrohl and Dylan Nelson},
  doi = {10.21105/joss.06064},
  url = {https://doi.org/10.21105/joss.06064}, year = {2024},
  publisher = {The Open Journal}, volume = {9}, number = {94},
  pages = {6064},
  journal = {Journal of Open Source Software}
}

Issues

If you encounter any problems, please file an issue along with a detailed description.

License

Distributed under the terms of the MIT license, scida is free and open source software.

scida's People

Contributors

Stargazers

Watchers

Forkers

nelson-group migelo knut0815 arkordt

scida's Issues

Title: TypeError when Indexing Halo Object with Dask Array

from scida import load

ds = load(basePath+"/snapdir_099")

data = ds.return_data(subhaloID=42)

The above code is taken from the website https://cbyrohl.github.io/scida/halocatalogs/ and it gives the following error:

TypeError                                 Traceback (most recent call last)
Cell In[132], line 4
      1 from scida import load
      2 ds = load(basePath+"[/snapdir_099](https://vscode-remote+ssh-002dremote-002bvera2.vscode-resource.vscode-cdn.net/snapdir_099)")
----> 4 data = ds.return_data(haloID=42)

File [~/conda-envs/myenv/lib/python3.10/site-packages/scida/interface.py:385](https://vscode-remote+ssh-002dremote-002bvera2.vscode-resource.vscode-cdn.net/u/fahma/Test/Metallicity/~/conda-envs/myenv/lib/python3.10/site-packages/scida/interface.py:385), in Selector.__call__..newfn(*args, **kwargs)
    382 self.data = FieldContainer()
    383 deepdictkeycopy(self.data_backup, self.data)
--> 385 self.prepare(*args, **kwargs)
    386 if self.keys is None:
    387     raise NotImplementedError(
    388         "Subclass implementation needed for self.keys!"
    389     )

File [~/conda-envs/myenv/lib/python3.10/site-packages/scida/customs/arepo/dataset.py:56](https://vscode-remote+ssh-002dremote-002bvera2.vscode-resource.vscode-cdn.net/u/fahma/Test/Metallicity/~/conda-envs/myenv/lib/python3.10/site-packages/scida/customs/arepo/dataset.py:56), in ArepoSelector.prepare(self, *args, **kwargs)
     54         length = lengths[pnum]
     55         for k, v in self.data_backup[p].items():
---> 56             self.data[p][k] = v[offset : offset + length]
     57 snap.data = self.data

File [~/conda-envs/myenv/lib/python3.10/site-packages/pint/facets/numpy/quantity.py:242](https://vscode-remote+ssh-002dremote-002bvera2.vscode-resource.vscode-cdn.net/u/fahma/Test/Metallicity/~/conda-envs/myenv/lib/python3.10/site-packages/pint/facets/numpy/quantity.py:242), in NumpyQuantity.__getitem__(self, key)
    240     raise
    241 except TypeError:
--> 242     raise TypeError(
    243         "Neither Quantity object nor its magnitude ({})"
    244         "supports indexing".format(self._magnitude)
    245     )

TypeError: Neither Quantity object nor its magnitude (dask.array)supports indexing

scalefac is not being read from snap/catalog metadata

For example

In [112]: snap.ureg['a'].to_base_units()
WARNING:pint.util:Calling the getitem method from a UnitRegistry is deprecated. use `parse_expression` method or use the registry as a callable.
Out[112]: 1.0 <Unit('dimensionless')>

In [113]: snap = load("/virgotng/universe/IllustrisTNG/TNG50-4/output/snapdir     ...: _050/", catalog="/virgotng/universe/IllustrisTNG/TNG50-4/output/gro     ...: ups_050/", units=True)

uid -> id or index?

Suggest to rename uid.

Both id and index are commonly already used for both particle and group/subhalo indices. Can we avoid creating a new name?

observational data support/example (GAIA)

We want to add at least two observational data analysis example for launch. Let's start with:

GAIA (/virgotng/mpia/obs/GAIA/) in single HDF5 format

i.e. add a short blurb about this even to the "Getting Started" page, create a new tutorial page for this case, and add to "Supported Datasets" under obs sets.

support FIRE-2 public simulations

The FIRE-2 simulations were publicly released. We have them in their original format, an example is:

/virgotng/mpia/FIRE-2/core_FIRE-2_runs/m12b_res7100/

we should want/check 'out of the box' support, for the snapshots at a minimum.

(0) Unfortunately I don't see any units/metadata. You could get these from Wetzel+2023 (Section 4.2)

(1) These are also "GIZMO" simulation outputs, like SIMBA (original files), but different model.

(2) The halo/rockstar_dm/ directory contains the output of a (modified) rockstar substructure finder. Auto-discovery of these associated catalogs could be added, but not essential. In particular, as snapshots are not group ordered, there are no groupby type operations (easily) possible, so these are just catalogs of numbers.

support .fits data format

Most observational datasets will be in FITS instead of HDF5.

An example for testing is /virgotng/mpia/obs/SDSS/specObj-dr16.fits.

It would be advantageous to support FITS natively, without necessarily requiring e.g. conversion into a better data format.

define gas Temperature as derived field

Needs to ship and be available (for any simulation where it makes sense/we know how to compute it).

In general, a handful of common derived fields should be included, even just to give some examples of how they can be made and look.

docs: getting started metadata

Move "Metadata" much further down in the Getting Started page.

Does it even need to be on this page?

two-point statistics

It would be useful to support various easy/common two point statistics, such as two-point correlation function.

Definitely via external package. Which one(s)?

(load upon request; no overall package dependency)

Question: Tips needed to make the code faster

Hi! This is not a bug in the code. I have a task for which I have prepared a code to the best of my abilities. I want to know if there is any way to make it run faster.

I want to get the data of gas particles within a simulation snapshot that exists outside the R200_crit distance from the center of each halo. In other words, I want to separate the gas that exists outside the halo vs inside the halo.
The code that I am using to do that for 1 halo is:

basePath = "/virgotng/universe/IllustrisTNG/TNG100-3/output/"
ds = load(basePath+"/snapdir_099",units=False)
gas99 = ds.data["gas"] 
group = ds["Group"]
groupPos = group["GroupPos"][0]
R200_crit = group["Group_R_Crit200"][0]
# Getting the halo gas data from the subset at 3 R_200crit
# Calculate the vector differences between each gas particle position and the groupPos
delta = gas99["Coordinates"] - groupPos

# Calculate the Euclidean distance for each particle from the groupPos (we use axis=1 as the coordinates are 3D)
distances = np.sqrt(np.sum(delta**2, axis=1))

# Identify the particles within 3 times R200_crit from the groupPos
mask = distances <= 3 * R200_crit

# Select the particles that meet the condition
#particles_within_R200 = gas99[mask.compute()]
particles_within_R200 = {key: gas99[key][mask] for key in gas99.keys()}

On the vera cluster, it took roughly 4.4 seconds for the first halo.
For all the halos in the simulation, it could take a very long time.

This is my code to remove the halo gas:

from tqdm import tqdm


ds = load(basePath+"/snapdir_099",units=False)
gas99 = ds.data["gas"] 
group = ds["Group"]
groupPos = group["GroupPos"]
R200_crit = group["Group_R_Crit200"]

def calculate_mask(i, mask, pos, R200_crit, coords):
    # Calculate the vector differences between each gas particle position and the pos
    delta = coords - pos

    # Calculate the Euclidean distance for each particle from the pos (we use axis=1 as the coordinates are 3D)
    distances = np.sqrt(np.sum(delta**2, axis=1))

    # Identify the particles within R200_crit from the pos
    halo_mask = distances < R200_crit

    # Update the mask to exclude these particles
    updated_mask = mask & ~halo_mask

    return updated_mask

# Start with a mask that includes all particles
mask = np.ones(len(gas["Coordinates"]), dtype=bool)

# Calculate masks for each groupPos and R200_crit
for i in tqdm(range(len(groupPos)), desc='Calculating masks'):
    mask = calculate_mask(i, mask, groupPos[i], R200_crits[i], gas["Coordinates"])

# At this point, the mask should include only the particles that are outside all halos
particles_outside_halos = {key: gas[key][mask] for key in ["CenterOfMass","Coordinates","Masses","Density","GFM_Metallicity","GroupID","ParticleIDs","Velocities"]}

I tried parallelizing the code as well as turning some of the arrays into numpy arrays, but it did not work and somehow was slower.

the "10 analysis questions"

If we asked users "what are the ten most important tasks you want to achieve with the data", we should aim to support the most commonly requested ones.

"Quick inspection":
- Show some of the meta data, the data dictionary structure, etc
Plot a histogram of a field / calculate a reduction statistic.

Then there is a set which are specific for cosmological/galaxy simulations:

Calculate a radial profile of some quantity centered on an object of interest.
Calculate per halo reductions (i.e. summary statistics) from particle-level information.
Make an image: projections and slices.
Make a phase space diagram, e.g. 2D histogram (weighted by mass, or otherwise) of two particle-level quantities.
For N simulations, compare them all easily. For instance, overplot them all on the same stellar mass function.

docs: one tutorial for 'galaxy sims', one tutorial for 'obs data'

Move the current "Halo and galaxy catalogs" content into the galaxy sims tutorial.

Start a new tutorial for obs data, we can use GAIA as the example.

Further docs updates:

Rename "Misc" to "Supported Features" (or similar).
Move "units" under "Supported Features".
Move "series" and "visualization" also under "Supported Features".
Start an "Advanced Features" tab, and post "Processing large datasets" there.
Move the animated vis image from the first main page into "visualization". Replace it with a logo placeholder.

Full MTNG support

Preliminary MTNG support

mixing snap and catalog of different snapshots cannot be allowed

snap = load("/virgotng/universe/IllustrisTNG/TNG50-4/output/snapdir_
    ...: 050/", catalog="/virgotng/universe/IllustrisTNG/TNG50-4/output/group
    ...: s_099/", units=True)

shouldn't work.

document halo reduction operations

An example for map_halo_operation could be to calculate 1D and 2D profiles.

(Needs to be a "Getting Started" section, and also can have its own page if there is need for such detail.

astrophysics code support

From our earlier HedgeDoc:

AREPO (aim to fully support public outputs as a standard?)
GADGET (2? 3? 4? aim to fully support public G4 outputs as a standard?)
GIZMO
SWIFT
ILTIS photon output

caching speed/decision

If I type this:

 sim = load("/virgotng/universe/IllustrisTNG/TNG50-2/")

I need it to work in less than 5 seconds.

new name required for package

Top choice so far: "the data javelin."

.fields incomplete

I was surprised that the following reports only 4 fields (3 derived/internal and CenterOfMass)?

In [1]: from scida import load
Warning! Using default configuration. Please adjust/replace in '/u/dnelson/.config/scida/config.yaml'.
/u/dnelson/.local/envs/py3/lib/python3.9/site-packages/scida/customs/arepo/dataset.py:643: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def get_shidx(

In [2]: ds = load('TNG50-1')

In [3]: ds[99].data['gas'].fields
WARNING:pint.util:Redefining 'Msun' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'ckpc' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'code_length' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'code_velocity' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'code_mass' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'code_time' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'code_pressure' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Out[3]:
{'CenterOfMass': dask.array<mul, shape=(8737803526, 3), dtype=float32, chunksize=(11184810, 3), chunktype=numpy.ndarray> <Unit('code_length')>,
 'uid': dask.array<arange, shape=(8737803526,), dtype=int64, chunksize=(16777216,), chunktype=numpy.ndarray>,
 'GroupID': dask.array<mul, shape=(8737803526,), dtype=int64, chunksize=(16777216,), chunktype=numpy.ndarray> <Unit('dimensionless')>,
 'SubhaloID': dask.array<mul, shape=(8737803526,), dtype=int64, chunksize=(16777216,), chunktype=numpy.ndarray> <Unit('dimensionless')>}

In [4]: ds[99].data['gas']
Out[4]: FieldContainer[containers=0, fields=4]

cannot load simulation by alias/name

In [33]: sim = load("TNG50-2")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [33], in <cell line: 1>()
----> 1 sim = load("TNG50-2")

File /vera/u/dnelson/astrodask/src/astrodask/convenience.py:238, in load(path, units, unitfile, overwrite, **kwargs)
    231 def load(
    232     path: str,
    233     units: Union[bool, str] = False,
   (...)
    236     **kwargs
    237 ):
--> 238     path = find_path(path, overwrite=overwrite)
    240     if "catalog" in kwargs and kwargs["catalog"] is not None:
    241         if not isinstance(kwargs["catalog"], list):

File /vera/u/dnelson/astrodask/src/astrodask/convenience.py:227, in find_path(path, overwrite)
    225                 break
    226     if not found:
--> 227         raise ValueError("Specified path '%s' unknown." % path)
    228 return path

ValueError: Specified path 'TNG50-2' unknown.
> /vera/u/dnelson/astrodask/src/astrodask/convenience.py(227)find_path()
    225                     break
    226         if not found:
--> 227             raise ValueError("Specified path '%s' unknown." % path)
    228     return path
    229

per [sub]halo particle selection details

The current example under

"Selecting particles/cells which belong to a particular [sub]halo?"

mentions the code:

data = snap.return_data(haloID=42) # the result contains all fields for all particle types for a given halo
data["PartType0"]["Density"] # density for all gas particles in halo

it isn't clear, are there disk loads occurring here? If so, let's rename this section "Loading". Either way, let's mention this explicitly and describe what is occuring.

An extremely common use case for analysis is a loop over objects e.g.

for i in range(n_halos):
    data = snap.return_data(haloID=i)
    result[i] = data["PartTypeX"]["Field"].statistic()

Will this work? Is it efficient? (Is it touching files once per halo, or not?) If so, let us document and suggest this approach. If not, we need to suggest the best way to work in this manner.

Clarification needed on the units

While loading the Coordinates of the gas particles using the Illustris package, the units given on the website is ckpc/h https://www.tng-project.org/data/docs/specifications/#parttype0
The units for GroupPos variable as well are ckpc/h as well.

When looking at the same variables using the scida package, the units for Coordinates are centimeter

coords = gas["Coordinates"]
coords.to_base_units().units
----------------------
</> centimeter

When looking at the GroupPos variable, the units are dimensionless

import dask.array as da
ds = load(basePath+"/snapdir_099", units=True)
data = ds.return_data(haloID=42)
grp = data["Group"]
grp["GroupPos"].units
----------------------
</> dimensionless

Additionally, for the variable Coordinates, the maximum value for the same snapshot (TNG100-3 snap 99) is the same whether the data is read from the Illustris package or the Scida package:

gas_il = il.snapshot.loadHalo(basePath,99,0,'gas',fields="Coordinates")
print("gas_il -  max Coordinate: ", np.max(np.array(gas_il))) #UNITS ckpc/h

ds = load(basePath+"/snapdir_099", units=True)
print("gas_scida - max Coordinate: ", ds.data["Group"]["Coordinates"].max().compute())
-----------------------------------------------------------------------------------------
</> gas_il -  max Coordinate:  74999.99978918549
 gas_scida - max Coordinate:  74999.8984375 dimensionless

document YAML simulation descriptors

Tell people how they can/should create custom YAML files for the simulations they are working with, i.e. beyond those supported 'out of the box'.

grouping and reduction operation on subhalos

We need support, as now for groups, but for subhalos.

For example, a command like:

mass = ds.grouped("Masses", parttype="PartType0").sum().evaluate(compute=True)

except reducing over subhalos. (Each subhalo is a contiguous slice of the particle arrays, just like a group).

simulation dataset support

Desired "dataset support", from our previous HedgeDoc.

Simulations:

idealized Arepo simulation support

We have so far focused mostly on cosmological simulations.

Let us also check minimal support for idealized simulations (e.g. much different units, different field conventions possible, etc).

e.g. isolated galaxy disk in arepo/examples/ (from public version of the code).

"Interactive Visualization" to "Visualization"

Change this documentation page name to just "Visualization" such that it can be more general and hold all (initial) vis related docs.

numpy version requirement?

In [1]: from scida import load
Warning! Using default configuration. Please adjust/replace in '/u/dnelson/.config/scida/config.yaml'.
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 from scida import load

File ~/.local/envs/py3/lib/python3.9/site-packages/scida/__init__.py:7
      4 # need to import interfaces to register them in registry for loading
      5 # TODO: automatize implicitly
      6 from scida.convenience import load
----> 7 from scida.customs.arepo.dataset import ArepoSnapshot
      8 from scida.customs.arepo.series import ArepoSimulation
      9 from scida.interfaces.gadgetstyle import GadgetStyleSnapshot

File ~/.local/envs/py3/lib/python3.9/site-packages/scida/customs/arepo/dataset.py:12
     10 from dask import delayed
     11 from numba import jit, njit
---> 12 from numpy._typing import NDArray
     14 from scida.discovertypes import _determine_mixins
     15 from scida.fields import FieldContainer

ModuleNotFoundError: No module named 'numpy._typing'
> /u/dnelson/.local/envs/py3/lib/python3.9/site-packages/scida/customs/arepo/dataset.py(12)<module>()
     10 from dask import delayed
     11 from numba import jit, njit
---> 12 from numpy._typing import NDArray
     13
     14 from scida.discovertypes import _determine_mixins

$ pip list

numpy                         1.21.5

GAIA full and subset dataset link

We can update the docs, to point to https://www.tng-project.org/data/obs/ for downloading both the full, and 'mini', HDF5 versions of the GAIA DR3 for the observational tutorial.

progressbar when creating dataset virtual file

E.g.

In [1]: from astrodask import load
In [2]: ds = load("TNG50-1")

In [3]: snap = ds.get_dataset(z=0.0)

In [4]: snap
Out[4]:

appears to "hang" for a long time. We need a progress bar here if a cache/virtual file is being created.

get_dataset support for 'snap' and 'z' and 't' and 'time'

Right now, only "redshift" works.

snapshot load should look for corresponding catalog

If a snapshot is accompanied by a catalog, then it should automatically load it as well (by default).

info() does not exist for series

>>> from scida import load
>>> ds = load('/virgotng/universe/IllustrisTNG/TNG100-3/', units=True) 
>>> ds.info()

as in the tutorial, gives an error:

AttributeError: 'ArepoSimulation' object has no attribute 'info'

series needs len

linking in third-party/external data files

Imagine you have an external HDF5 file which is not in the snapshot, and not in the catalog.

How do you 'link' it in, so that we know about it, and can be combined with existing datasets.

e.g.

ds = load('simname')
ds.data['PartType0']['MyNewField'] = X

provide documentation examples of what X can be.

(1) loaded using our same load architecture
(2) a simple ndarray
(3) a daskarray of a HDF5 dataset in a HDF5 file on disk

SubhaloID cannot be computed for small chunksizes

from scida import load
series = load("TNG50-1")
ds = series.get_dataset(redshift=3.0)
data = ds.data
print(ds)
data["PartType0"]["SubhaloID"].compute().magnitude

fails on the last line with

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/dask/highlevelgraph.py:550, in HighLevelGraph.get_all_external_keys(self)
    549 try:
--> 550     return self._all_external_keys
    551 except AttributeError:

AttributeError: 'HighLevelGraph' object has no attribute '_all_external_keys'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 data["PartType0"]["SubhaloID"].compute().magnitude

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/pint/facets/dask/__init__.py:32, in check_dask_array.<locals>.wrapper(self, *args, **kwargs)
     29 @functools.wraps(f)
     30 def wrapper(self, *args, **kwargs):
     31     if isinstance(self._magnitude, dask_array.Array):
---> 32         return f(self, *args, **kwargs)
     33     else:
     34         msg = "Method {} only implemented for objects of {}, not {}".format(
     35             f.__name__, dask_array.Array, self._magnitude.__class__
     36         )

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/pint/facets/dask/__init__.py:92, in DaskQuantity.compute(self, **kwargs)
     78 @check_dask_array
     79 def compute(self, **kwargs):
     80     """Compute the Dask array wrapped by pint.PlainQuantity.
     81 
     82     Parameters
   (...)
     90         A pint.PlainQuantity wrapped numpy array.
     91     """
---> 92     (result,) = compute(self, **kwargs)
     93     return result

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/dask/base.py:589, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    581     return args
    583 schedule = get_scheduler(
    584     scheduler=scheduler,
    585     collections=collections,
    586     get=get,
    587 )
--> 589 dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
    590 keys, postcomputes = [], []
    591 for x in collections:

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/dask/base.py:362, in collections_to_dsk(collections, optimize_graph, optimizations, **kwargs)
    360 for opt, val in groups.items():
    361     dsk, keys = _extract_graph_and_keys(val)
--> 362     dsk = opt(dsk, keys, **kwargs)
    364     for opt_inner in optimizations:
    365         dsk = opt_inner(dsk, keys, **kwargs)

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/dask/array/optimization.py:51, in optimize(dsk, keys, fuse_keys, fast_functions, inline_functions_fast_functions, rename_fused_keys, **kwargs)
     49 dsk = optimize_blockwise(dsk, keys=keys)
     50 dsk = fuse_roots(dsk, keys=keys)
---> 51 dsk = dsk.cull(set(keys))
     53 # Perform low-level fusion unless the user has
     54 # specified False explicitly.
     55 if config.get("optimization.fuse.active") is False:

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/dask/highlevelgraph.py:707, in HighLevelGraph.cull(self, keys)
    703 from dask.layers import Blockwise
    705 keys_set = set(flatten(keys))
--> 707 all_ext_keys = self.get_all_external_keys()
    708 ret_layers: dict = {}
    709 ret_key_deps: dict = {}

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/dask/highlevelgraph.py:557, in HighLevelGraph.get_all_external_keys(self)
    552 keys: set = set()
    553 for layer in self.layers.values():
    554     # Note: don't use `keys |= ...`, because the RHS is a
    555     # collections.abc.Set rather than a real set, and this will
    556     # cause a whole new set to be constructed.
--> 557     keys.update(layer.get_output_keys())
    558 self._all_external_keys = keys
    559 return keys

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/dask/blockwise.py:486, in Blockwise.get_output_keys(self)
    480     return {(self.output, *p) for p in self.output_blocks}
    482 # Return all possible output keys (no culling)
    483 return {
    484     (self.output, *p)
    485     for p in itertools.product(
--> 486         *[range(self.dims[i]) for i in self.output_indices]
    487     )
    488 }

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/dask/blockwise.py:486, in <listcomp>(.0)
    480     return {(self.output, *p) for p in self.output_blocks}
    482 # Return all possible output keys (no culling)
    483 return {
    484     (self.output, *p)
    485     for p in itertools.product(
--> 486         *[range(self.dims[i]) for i in self.output_indices]
    487     )
    488 }

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/dask/blockwise.py:446, in Blockwise.dims(self)
    442 """Returns a dictionary mapping between each index specified in
    443 `self.indices` and the number of output blocks for that indice.
    444 """
    445 if not hasattr(self, "_dims"):
--> 446     self._dims = _make_dims(self.indices, self.numblocks, self.new_axes)
    447 return self._dims

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/dask/blockwise.py:1484, in _make_dims(indices, numblocks, new_axes)
   1480 def _make_dims(indices, numblocks, new_axes):
   1481     """Returns a dictionary mapping between each index specified in
   1482     `indices` and the number of output blocks for that indice.
   1483     """
-> 1484     dims = broadcast_dimensions(indices, numblocks)
   1485     for k, v in new_axes.items():
   1486         dims[k] = len(v) if isinstance(v, tuple) else 1

File ~/.cache/pypoetry/virtualenvs/paper-labs-analysis-abIgZPoN-py3.9/lib/python3.9/site-packages/dask/blockwise.py:1475, in broadcast_dimensions(argpairs, numblocks, sentinels, consolidate)
   1472     return toolz.valmap(consolidate, g2)
   1474 if g2 and not set(map(len, g2.values())) == {1}:
-> 1475     raise ValueError("Shapes do not align %s" % g)
   1477 return toolz.valmap(toolz.first, g2)

ValueError: Shapes do not align {'.0': {1, 2, 571}}

However, when increasing the chunksize from 128MiB to 256MiB or more, the calculation succeeds.

unit scalefac not working

For snapshot = 50

In [79]: snap.data['PartType0']['Velocities'].to('km/s')[0].compute()
[########################################] | 100% Completed | 1.20 ss
Out[79]: array([ 109.10534, -176.66605,   98.3127 ], dtype=float32) <Unit('kilometer / second')>

In [80]: snap.data['PartType0']['Velocities'][0].compute()
[########################################] | 100% Completed | 501.46 ms
Out[80]: array([ 109.10534, -176.66605,   98.3127 ], dtype=float32) <Unit('code_velocity')>

zarr virtual/kerchunked datasets

Using kerchunk we can wrap a HDF5 file as a "virtual" zarr dataset. All the original data remains in the HDF5 file(s), the zarr wrapper file simply contains the needed metadata for zarr (or xarray) to directly access the data, bypassing the HDF5 library entirely.

An example for GAIA is here:

/virgotng/mpia/obs/GAIA/gaia_dr3.zarr

(it is actually a json file). It already concatenates the "gaia_dr3.hdf5" and "gaia_dr3_aux.hdf5" files.

It can be read by zarr as:

import fsspec
import zarr

m = fsspec.filesystem("reference", fo="gaia_dr3.zarr").get_mapper("")
ds = zarr.open(m, mode='r'))

We could support this by default, detect if a requested load is in fact such a kerchunked json, in which case, pass to fsspec and then treat m as a zarr file.

document list of simulations/datasets with 'out of the box' support

I.e. in the documentation we should have a list of common simulations, and observational datasets, which the package can easily work with, either (i) because we have added custom support e.g. the YAML files are there already, or (ii) the 'original' data formats that a public user would download the data as are sufficiently compatible.

Allow limiting range/selection for grouped() operations

pint warnings must be supressed

snap = load("/virgotng/universe/IllustrisTNG/TNG50-4/output/snapdir_
    ...: 050/", units=True)

WARNING:pint.util:Redefining 'h' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'a' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'h' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'a' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)

Migrate CICD to github

As we move towards github, so should the pipelines currently hosted on MPCDF gitlab.

detect corrupt dataset virtual file if interrupted

When creating any cache files/virtual files for datasets, if the user interrupts, or something crashes, the file will be incomplete and corrupt. Need to detect this (e.g. by adding a "done" attribute) and automatically re-create if needed.

document unit system

Let's add a section to the "Getting Started" page about the units and how they work.

If more details are needed, we can create a separate "Units" page.

(I definitely think we need units=True to be the default).

CoordinateName warnings

INFO:astrodask.interfaces.mixins.spatial:Did not find CoordinatesName for species 'PartType3'
INFO:astrodask.interfaces.mixins.spatial:Did not find CoordinatesName for species 'PartType3'
INFO:astrodask.interfaces.mixins.spatial:Did not find CoordinatesName for species 'PartType3'
INFO:astrodask.interfaces.mixins.spatial:Did not find CoordinatesName for species 'PartType3'
INFO:astrodask.interfaces.mixins.spatial:Did not find CoordinatesName for species 'Group'
INFO:astrodask.interfaces.mixins.spatial:Did not find CoordinatesName for species 'Group'
INFO:astrodask.interfaces.mixins.spatial:Did not find CoordinatesName for species 'Subhalo'
INFO:astrodask.interfaces.mixins.spatial:Did not find CoordinatesName for species 'Subhalo'
INFO:astrodask.interfaces.mixins.spatial:Did not find CoordinatesName for species 'Subhalo'

Let's change these to DEBUG i.e. hide by default.

Encountering Error in Halo Radial Profile Code - 'numpy.ndarray' object has no attribute '_meta'

Hi! I tried to run the code for getting the Radial profile for each halo on the wesbite https://cbyrohl.github.io/scida/halocatalogs/, and ran into some errors. Could you please check? Thanks

import numpy as np
from scipy.stats import binned_statistic

basePath = "/virgotng/universe/IllustrisTNG/TNG100-3/output"

ds = load(basePath+"/snapdir_099", units=True)

gas = ds.data["PartType0"]
vol = gas["Masses"] / gas["Density"]
grp = ds.data["Group"]
pos3 = gas["Coordinates"] - grp["GroupPos"][gas["GroupID"]]
dist = da.sqrt(da.sum((pos3)**2, axis=1)) 

def customfunc(dist, density, volume):
    a = binned_statistic(dist, density, statistic="sum", bins=np.linspace(0, 200, 10))[0]
    b = binned_statistic(dist, volume, statistic="sum", bins=np.linspace(0, 200, 10))[0]
    return a/b

g = ds.grouped(dict(dist=dist, Density=gas["Density"],
                    Volume=vol))
s = g.apply(customfunc).evaluate()

The error that I am getting is:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[11], line 11
      9 vol = gas["Masses"] [/](https://vscode-remote+ssh-002dremote-002bvera2.vscode-resource.vscode-cdn.net/) gas["Density"]
     10 grp = ds.data["Group"]
---> 11 pos3 = gas["Coordinates"] - grp["GroupPos"][gas["GroupID"]]
     12 dist = da.sqrt(da.sum((pos3)**2, axis=1)) 
     14 def customfunc(dist, density, volume):

File [~/conda-envs/myenv/lib/python3.10/site-packages/pint/facets/numpy/quantity.py:238](https://vscode-remote+ssh-002dremote-002bvera2.vscode-resource.vscode-cdn.net/u/fahma/Test/Metallicity/~/conda-envs/myenv/lib/python3.10/site-packages/pint/facets/numpy/quantity.py:238), in NumpyQuantity.__getitem__(self, key)
    236 def __getitem__(self, key):
    237     try:
--> 238         return type(self)(self._magnitude[key], self._units)
    239     except PintTypeError:
    240         raise

File [~/conda-envs/myenv/lib/python3.10/site-packages/dask/array/core.py:1993](https://vscode-remote+ssh-002dremote-002bvera2.vscode-resource.vscode-cdn.net/u/fahma/Test/Metallicity/~/conda-envs/myenv/lib/python3.10/site-packages/dask/array/core.py:1993), in Array.__getitem__(self, index)
   1990     return self
   1992 out = "getitem-" + tokenize(self, index2)
-> 1993 dsk, chunks = slice_array(out, self.name, self.chunks, index2, self.itemsize)
   1995 graph = HighLevelGraph.from_collections(out, dsk, dependencies=[self])
   1997 meta = meta_from_array(self._meta, ndim=len(chunks))

File [~/conda-envs/myenv/lib/python3.10/site-packages/dask/array/slicing.py:176](https://vscode-remote+ssh-002dremote-002bvera2.vscode-resource.vscode-cdn.net/u/fahma/Test/Metallicity/~/conda-envs/myenv/lib/python3.10/site-packages/dask/array/slicing.py:176), in slice_array(out_name, in_name, blockdims, index, itemsize)
    173 index += (slice(None, None, None),) * missing
...
    825     adjust_chunks={0: 1},  # one row for each block in a
    826 )
    828 # add offsets to take account of the position of each block within the array a

AttributeError: 'numpy.ndarray' object has no attribute '_meta'

PartTypeN.keys() needs to show all

For example

In [9]: snap.data['PartType0'].keys()
Out[9]: dict_keys(['CenterOfMass', 'ElectronAbundance', 'uid'])

needs to show all (by default).

It makes sense that "all" means only those fields which actually exist in the files. (Derived fields can be skipped) (It would be good to somehow be able to see a list of derived fields as well).

"<u6" datatype support

MTNG uses a custom 6-byte-integer data type for IDs. h5py/numpy does not support this. Any reference to obj.dtype (also indirectly via obj.nbytes), will result in:

/lib/python3.9/site-packages/h5py/_hl/dataset.py:545: in dtype
    return self.id.dtype
h5py/h5d.pyx:179: in h5py.h5d.DatasetID.dtype.__get__
    ???
h5py/h5d.pyx:182: in h5py.h5d.DatasetID.dtype.__get__
    ???
h5py/h5t.pyx:434: in h5py.h5t.TypeID.dtype.__get__
    ???
h5py/h5t.pyx:435: in h5py.h5t.TypeID.dtype.__get__
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   TypeError: data type '<u6' not understood

h5py/h5t.pyx:954: TypeError

Some related discussion for custom types in h5py/h5py#1822 and h5py/h5py#1825

We might want to cast this to uint64. However, there are two problems to be solved:

1.) How to create virtual hdf5 datasets for custom types?
2.) How to map the custom data type to uint64 in dask? We would need to have some kind of "stride" operation when mapping the binary blob of 6 byte elements onto 8 byte elements filled with zeros.

support fields with missing/unknown units

It will be natural to encounter new fields e.g. in AREPO outputs which we have never seen before. Metadata may not be present. In this case, a clearly indicated 'missing/unknown units' needs to be attached.

For Illustris the FluidQuantities field can adopt this for now.

document creation of derived fields

List and describe any which ship with the package, and describe how a user can create their own new derived fields.