GithubHelp home page GithubHelp logo

cloud-drift / clouddrift Goto Github PK

View Code? Open in Web Editor NEW
36.0 6.0 8.0 5.51 MB

CloudDrift accelerates the use of Lagrangian data for atmospheric, oceanic, and climate sciences.

Home Page: https://clouddrift.org/

License: MIT License

Python 100.00%
climate-data climate-science data-structures oceanography python

clouddrift's People

Contributors

kevinsantana11 avatar milancurcic avatar philippemiron avatar selipot avatar vadmbertr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clouddrift's Issues

Implement `RaggedArray.from_awkward()`

Similar to #44, but for instantiating a RaggedArray from an awkward array instance.

Not sure yet whether and to what extent is this useful, but at least for feature parity with xarray.

Like in #44, this is already used internally in RaggedArray.from_parquet():

ds = ak.from_parquet(filename)
attrs_global = ds.layout.parameters["attrs"]
name_coords = ["time", "lon", "lat", "ids"]
for var in name_coords:
coords[var] = ak.flatten(ds.obs[var]).to_numpy()
attrs_variables[var] = ds.obs[var].layout.parameters["attrs"]
for var in [v for v in ds.fields if v != "obs"]:
metadata[var] = ds[var].to_numpy()
attrs_variables[var] = ds[var].layout.parameters["attrs"]
for var in [v for v in ds.obs.fields if v not in name_coords]:
data[var] = ak.flatten(ds.obs[var]).to_numpy()
attrs_variables[var] = ds.obs[var].layout.parameters["attrs"]

missing __version__ tag for the package

import clouddrift
print(clouddrift.__version__)

should return the current version. I believe that this is how you can easily create a GitHub action that would update the pypi package if this number is updated.

fail to "vectorize" velocity_from_position

I am attempting to apply velocity_from_position to xarray.DataArrays of lon, lat, and time. I have been following a tutorial for a similar situation. With the following ds Dataset:

ds.info()
xarray.Dataset {
dimensions:
	trajectory = 593297 ;
	obs = 1440 ;

variables:
	float64 time(trajectory, obs) ;
	float32 lat(trajectory, obs) ;
	float32 lon(trajectory, obs) ;
	int32 obs(obs) ;
	int64 trajectory(trajectory) ;
}

I can easily do:

u,v = velocity_from_position(ds.lon.isel(trajectory=0),ds.lat.isel(trajectory=0),ds.time.isel(trajectory=0))

or

u2,v2 = xr.apply_ufunc(
    velocity_from_position,
    ds.lon.isel(trajectory=0),
    ds.lat.isel(trajectory=0),
    ds.time.isel(trajectory=0),
    input_core_dims=[["obs"], ["obs"], ["obs"]],
    output_core_dims=[["obs"], ["obs"]],
    dask="allowed",
)

but the following fails:

u2,v2 = xr.apply_ufunc(
    velocity_from_position,  # first the function
    ds.lon.isel(trajectory=slice(0,10)),
    ds.lat.isel(trajectory=slice(0,10)),
    ds.time.isel(trajectory=slice(0,10)),
    input_core_dims=[["obs"], ["obs"], ["obs"]],
    output_core_dims=[["obs"],["obs"]],
    dask="allowed",
    vectorize=True,
)

and the bottom line of the error is

File ~/miniconda3/envs/research/lib/python3.10/site-packages/clouddrift/analysis.py:65, in velocity_from_position(x, y, time, coord_system, difference_scheme)
     57 # Compute dx, dy, and dt
     58 if difference_scheme == "forward":
     59 
     60     # All values except the ending boundary value are computed using the
   (...)
     63 
     64     # Time
---> 65     dt[:-1] = np.diff(time)
     66     dt[-1] = dt[-2]
     68     # Space

ValueError: could not broadcast input array from shape (10,1439) into shape (9,1440)

So yes, I get the error but I don't understand if the fix is to apply ufunc differently or make velocity_from_position more flexible?

Migrate notebooks to clouddrift-examples

The examples can then grow in their own repo and keep the core library repo clean. We can then also separate the dependencies needed for the core library and for examples (#41). We'll link to the examples repo from the docs and the core library README.

environment file issue

Couple of points about the environment file:

  1. Get the following warning when creating the environment:

Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.

  1. Would it be better practice to include the version number for the packages?

`velocity_from_position`: Handle ragged array

Previous discussion in #68.

@selipot suggested that velocity_from_position should also handle ragged arrays as input. Let's discuss here what these ragged arrays look like. I.e. is the ragged array in the form of an xarray Dataset as generated by clouddrift or something else?

Should docs be built and deployed on push to clouddrift/*.py files?

As I understand it, the docs are built and deployed when there is a push to files in docs/:

name: Docs
on:
push:
branches: [ main ]
paths:
- 'docs/**'

However, docstrings are sourced from Python module files and we need a manual re-build if, say, no new module is added (no changes to docs/) but a docstring in a .py file is updated. I've been running the docs workflow manually in such cases.

Is it reasonable to also build and deploy the docs on changes to *.py files? In other words:

 name: Docs 
  
 on: 
   push: 
     branches: [ main ] 
     paths: 
       - 'docs/**' 
       - 'clouddrift/*.py'

Building docs fails

See https://github.com/Cloud-Drift/clouddrift/actions/runs/3935783550/jobs/6731763496.

The relevant bit is:

Theme error:
An error happened in rendering the page api.
Reason: UndefinedError("'logo' is undefined")
make: *** [Makefile:20: html] Error 2
Error: Process completed with exit code 2.

This error goes away for me locally after I have commented out the html_theme_options in docs/conf.py. However, it doesn't seem to go away in GitHub Actions and I don't understand why.

@philippemiron do you have an idea?

Rename `ragged_array` -> `RaggedArray`

The Python style guide recommends TitleCase for class names.

This is a very minor issue. Newcomers to the library who are familiar with Python may be on first look get the idea that

from clouddrift import ragged_array

is a function and not a class. Since the project is very young and there are a few if any users, I think it'd be good to address this now and be consistent with the Python style guide for the public API.

Upgrade to awkward v2

Awkward v2 was released on Dec 9. As we don't require <2 for awkward in pyproject.toml dependencies field, the awkward import will need to change from

import awkward._v2 as ak

to

import awkward as ak

and require awkward>=2.0.0 in pyproject.toml.

Thanks to Ibis Gonzalez for reporting.

What dependencies are needed?

In pyproject.toml we have

dependencies = [
    "numpy>=1.22.4",
    "pandas>=1.4.2",
    "xarray>=2022.3.0",
    "netcdf4>=1.5.8",
    "pyarrow>=8.0.0",
    "zarr>=2.11.3",
    "numba>=0.53.1",
    "tqdm>=4.64.0",
    "fsspec>=2022.3.0",
    "awkward==1.9.0",
]

From my scanning of the project, we're not using fsspec, pandas, zarr, and numba. Can they be safely removed? We're not using netCDF4 directly, but it's an optional dependency of xarray.

Syntax error when importing clouddrift with Python 3.8

$ ipython
Python 3.8.10 (default, Jun 22 2022, 20:18:18) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import clouddrift
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [1], line 1
----> 1 import clouddrift

File ~/Work/clouddrift/venv/lib/python3.8/site-packages/clouddrift/__init__.py:1
----> 1 from clouddrift.dataformat import *

File ~/Work/clouddrift/venv/lib/python3.8/site-packages/clouddrift/dataformat.py:9
      5 from typing import Tuple, Optional
      6 from tqdm import tqdm
----> 9 class ragged_array:
     10     def __init__(
     11         self,
     12         coords: dict,
   (...)
     16         attrs_variables: Optional[dict] = {},
     17     ):
     18         self.coords = coords

File ~/Work/clouddrift/venv/lib/python3.8/site-packages/clouddrift/dataformat.py:29, in ragged_array()
     22     self.attrs_variables = attrs_variables
     23     self.validate_attributes()
     25 @classmethod
     26 def from_files(
     27     cls,
     28     indices: list,
---> 29     preprocess_func: Callable[[int], xr.Dataset],
     30     vars_coords: dict,
     31     vars_meta: list = [],
     32     vars_data: list = [],
     33     rowsize_func: Optional[Callable[[int], int]] = None,
     34 ):
     35     """Generate ragged arrays archive from a list of trajectory files
     36 
     37     Args:
   (...)
     46         obj: ragged array class object
     47     """
     48     # if no method is supplied, get the dimension from the preprocessing function

TypeError: 'ABCMeta' object is not subscriptable

As I understand it, the typing syntax for stuff in collections.abc was not introduced until Python 3.9.

We can remove the subscript here, or we can require Python 3.9. I recommend the latter.

Function to compute velocity from positions

Let's discuss the API for this function.

def velocity_from_positions(
    x: xr.DataArray,
    y: xr.DataArray,
    time: xr.DataArray,
    coords: str='spherical', # where x is lon and y is lat, otherwise can be 'cartesian' where x is eating and y is northing
    order: int=1, # can also be 2 for centered difference; we can discuss if higher orders may be desired
) -> Tuple(xr.DataArray, xr.DataArray)

I wonder if we shouldn't use coords to avoid confusion with the Xarray special coords. Perhaps coord_system or coordinate_system?

Until there is native support for ragged arrays in Xarray, the user should be careful to not run this on multiple consecutive trajectories. It will be easy for this function to detect that (e.g. jump in time and space) and issue a warning.

GDP 6-hourly dataset versioning

Right now preprocessing-6hourly.ipynb generates files such as gdp_6h_v2.00*.nc yet the label version 2.00 is applicable only to the hourly dataset. I think the versioning of the 6 hourly dataset is done with a cut off date. The latest date in the dataset created with that code is 2021-01-04T00:00:00.000000000 so I am guessing the cut off is December 2020. This needs to be checked with @RickLumpkin and Bertrand (is he on github?).

best way to create a clouddrift kernel for jupyter lab?

What's the best way to create a clouddrift kernel for jupyter lab?

After creating the clouddrift environment with conda I had to

  1. conda activate clouddrift
  2. conda install ipython ipykernel
  3. ipython kernel install --user --name=clouddrift
  4. Then in another environment with jupyter lab installed, launch jupyter lab, then select the clouddrift kernel within jupyter lab.

Is this the right way to go about this? That is for using a jupyter lab/notebook to run some of the clouddrift examples locally?

remove or stash timeseries.py

Please remove/stash/delete timeseries.py. It only contains a spectrum function which is incorrect. And it should certainly not be in the docs :)

Should we commit executed Jupyter notebooks?

Currently, the notebooks under examples/ are committed with cleared cells.

Should we commit executed notebooks? An advantage to doing this is that the notebook becomes readable in the browser, no need to run it. This is for people who want to get get a taste of it without having to set it up locally.

A disadvantage is a little more burden on maintenance. If the output of the cells changes (e.g. due to the change of the implementation or the API), the notebooks would need to be updated as well. If there are graphics in the notebooks (and there aren't yet), then the PNG images are encoded as strings which become part of the notebook JSON file. This can significantly increase the size of the git repo, although not in any problematic way.

All this said, I'm in favor of committing executed notebooks to the repo. I'm curious what you think.

GLAD dataset adapter

Part of #53.

Can be adapted from clouddrift-examples/data/glad.py into clouddrift/adapters.py.

Public website for the project

The main structure of the website is there (docs/), but it needs updates before public release.

  • modify the general description of the project
  • includes the prototype of the library

GDP sorted RA

  1. Read the directory files (https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/)
  2. Sort by death_date and/or deployment.

Update GDP `preprocess.py` in clouddrift

Based on the latest version used for the Jupyter Notebook, it should be more flexible to query data from https/ftp or the ERRDAP server.

Includes a way to fetch:

  • by specific IDs;
  • only alive drifters;
  • by date;
  • by some attributes.

All of this should be possible by initially reading the metadata file (dirfl_1_5000.dat).

lint

Ok, I was really confused by this linting error.

The files are fine, but there is a circular import... See for example when you run:

$ black haversine.py dataformat.py
All done! ✨ 🍰 ✨
2 files left unchanged.
(research) pmiron@m2air ~/Downloads/clouddrift/clouddrift
$ Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/multiprocessing/__init__.py", line 16, in <module>
    from . import context
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/multiprocessing/context.py", line 6, in <module>
    from . import reduction
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/multiprocessing/reduction.py", line 16, in <module>
    import socket
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/socket.py", line 54, in <module>
    import os, sys, io, selectors
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/selectors.py", line 12, in <module>
    import select
  File "/Users/pmiron/Downloads/clouddrift/clouddrift/select.py", line 1, in <module>
    import awkward as ak
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/site-packages/awkward/__init__.py", line 7, in <module>
    import awkward._nplikes
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/site-packages/awkward/_nplikes.py", line 7, in <module>
    import numpy
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/site-packages/numpy/__init__.py", line 140, in <module>
    from . import core
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/site-packages/numpy/core/__init__.py", line 100, in <module>
    from . import _add_newdocs_scalars
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/site-packages/numpy/core/_add_newdocs_scalars.py", line 9, in <module>
    import platform
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/platform.py", line 119, in <module>
    import subprocess
  File "/Users/pmiron/micromamba/envs/research/lib/python3.10/subprocess.py", line 223, in <module>
    _PopenSelector = selectors.SelectSelector
AttributeError: partially initialized module 'selectors' has no attribute 'SelectSelector' (most likely due to a circular import)

Long story short, the issue is with the name of the module select.py. If you rename it to anything else, it works. So my guess is that black is importing another module call select.py (https://docs.python.org/3/library/select.html).

The solution is to change the name of the select.py module, but to be honest I don't have any good suggestions.

conda package

The package is now available on pypi but a few steps are required to put this into a conda feedstock. I am waiting for awkward array >=1.9.0 (link) to be available in conda, which would simplify a lot building the package and setting up the environment. This should also fix the binder environment.

Note that the clouddrift library is using awkward array v2, which officially should be out by the end of Q4 2022. It is accessed using import awkward._v2 as ak for now, and a lot of this is only available from >1.8.0. Their conda package is currently on 1.8.0, but it is evolving pretty fast!

speed up processing of numerical datasets

With numerical outputs, it is not efficient to loop through the trajectories when we can simply identify the filling value and reshape the data into ragged arrays.

How to interpolate a gridded field onto the drifter locations?

I am trying a solution on the branch sst-interp in examples/interp_cci_drifters.ipynb with the cci sst analysis global dataset. That dataset is hosted on some AWS data repository. In order to make this interpolation manageable I am looping over trajectories using xarray. There is probably a better solution.

Extract/Select

  • Include a general function to filter by any attribute by taking a dictionary as input containing list of variables and ranges.
  • to extract a region we could pass a dictionary {'lon': [min, max], 'lat': [min, max], 'time': [min, max]}

Lagrangian observations datasets

What list of datasets should we use as examples for the project?

  • GulfDrifters: A consolidated surface drifter dataset for the Gulf of Mexico (link)
  • Strateole-2: Long-duration balloon flights at the tropical tropopause (link)
  • WHOI gliders (link)
  • AOML gliders (link)

Implement `RaggedArray.from_xarray()`

As discussed with @selipot on 10/20, it's in scope for clouddrift to allow getting a RaggedArray instance from an xarray.Dataset.

RaggedArray.from_netcdf() already does this internally, i.e.:

with xr.open_dataset(filename) as ds:
nb_traj = ds.dims["traj"]
nb_obs = ds.dims["obs"]
attrs_global = ds.attrs
for var in ds.coords.keys():
coords[var] = ds[var].data
attrs_variables[var] = ds[var].attrs
for var in ds.data_vars.keys():
if len(ds[var]) == nb_traj:
metadata[var] = ds[var].data
elif len(ds[var]) == nb_obs:
data[var] = ds[var].data
else:
print(
f"Error: variable '{var}' has unknown dimension size of {len(ds[var])}, which is not traj={nb_traj} or obs={nb_obs}."
)
attrs_variables[var] = ds[var].attrs

Some assumptions are currently needed for the dimension names (currently assumed "traj" and "obs"). The method should allow the user to specify dimension names to use, in absence of an established convention.

RaggedArray from numerical

I am following the example dataformat-numerical.ipynb to convert the output of an ocean parcels simulation to a ragged array and save to a NetCDF file but I do not understand how the time variable is handled and/or if the units can be specified. The NetCDF file written by parcels contain the variable time in units of seconds since a pivot date but the NetCDF file written by clouddrift after converting to a ragged array seems to be in minutes since the origin of the experiment. I dug through dataformat.py to understand but could not figure it out.

`velocity_from_position`: Handle n-d arrays

Previous discussion in #68.

This issue is to discuss whether and how should velocity_from_position handle n-d arrays. Some specific questions:

  • Is 2-d sufficient for most use cases or is n-d also useful? The second obvious dimension here are trajectories.
  • If not, what are the use cases for n-d?

failed install of clouddrift on HPC triton at UM

Installing the developper version of clouddrift fails even after loading a module to have a recent version of cmake. Error is below. Note it says clouddrift built successfully but it is not available in my environment
``
...
copying pyarrow/tests/data/parquet/v0.7.1.some-named-index.parquet -> build/lib.linux-ppc64le-cpython-310/pyarrow/tests/data/parquet
running build_ext
creating /tmp/pip-install-dc72ev4z/pyarrow_d29963b5e592436b82282ff9a8c4a3d3/build/cpp
-- Running CMake for PyArrow C++
cmake -DARROW_BUILD_DIR=build -DCMAKE_BUILD_TYPE=release -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_PREFIX=/tmp/pip-install-dc72ev4z/pyarrow_d29963b5e592436b82282ff9a8c4a3d3/build/dist -DPYTHON_EXECUTABLE=/home/selipot/miniconda3/envs/research/bin/python3.10 -DPython3_EXECUTABLE=/home/selipot/miniconda3/envs/research/bin/python3.10 -DPYARROW_CXXFLAGS= -DPYARROW_WITH_DATASET=off -DPYARROW_WITH_PARQUET_ENCRYPTION=off -DPYARROW_WITH_HDFS=off -DPYARROW_WITH_FLIGHT=off /tmp/pip-install-dc72ev4z/pyarrow_d29963b5e592436b82282ff9a8c4a3d3/pyarrow/src
-- The C compiler identification is GNU 8.3.1
-- The CXX compiler identification is GNU 8.3.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/rh/devtoolset-8/root/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rh/devtoolset-8/root/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at CMakeLists.txt:63 (find_package):
By not providing "FindArrow.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "Arrow", but
CMake did not find one.

    Could not find a package configuration file provided by "Arrow" with any of
    the following names:
  
      ArrowConfig.cmake
      arrow-config.cmake
  
    Add the installation prefix of "Arrow" to CMAKE_PREFIX_PATH or set
    "Arrow_DIR" to a directory containing one of the above files.  If "Arrow"
    provides a separate development package or SDK, be sure it has been
    installed.
  
  
  -- Configuring incomplete, errors occurred!
  See also "/tmp/pip-install-dc72ev4z/pyarrow_d29963b5e592436b82282ff9a8c4a3d3/build/cpp/CMakeFiles/CMakeOutput.log".
  error: command '/share/builds/ppcle/spack/opt/spack/linux-rhel7-power9le/gcc-8.3.1/cmake-3.20.2-ghhqkkvhbflxpgxyzumbfrip46j4ga3f/bin/cmake' failed with exit code 1
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for pyarrow
Successfully built clouddrift
Failed to build pyarrow
ERROR: Could not build wheels for pyarrow, which is required to install pyproject.toml-based projects
``

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.