GithubHelp home page GithubHelp logo

nvidia / earth2studio Goto Github PK

View Code? Open in Web Editor NEW
66.0 7.0 22.0 150.62 MB

Open-source deep-learning framework for exploring, building and deploying AI weather/climate workflows.

Home Page: https://nvidia.github.io/earth2studio/

License: Apache License 2.0

Makefile 0.20% Python 99.80%
ai climate-science deep-learning weather

earth2studio's Introduction

Earth2Studio Banner

python version license format coverage

Earth2Studio is a Python-based package designed to get users up and running with AI weather and climate models fast. Our mission is to enable everyone to build, research and explore AI driven meteorology.

- Earth2Studio Documentation -

Install | User-Guide | Examples | API

Earth2Studio Banner

Quick start

Install Earth2Studio:

pip install earth2studio

Run a deterministic weather prediction in just a few lines of code:

from earth2studio.models.px import DLWP
from earth2studio.data import GFS
from earth2studio.io import NetCDF4Backend
from earth2studio.run import deterministic as run

model = DLWP.load_model(DLWP.load_default_package())
ds = GFS()
io = NetCDF4Backend("output.nc")

run(["2024-01-01"], 10, model, ds, io)

Features

Earth2Studio provides access to pre-trained AI weather models and inference features through an easy to use and extendable Python interface. This package focuses on supplying users the tools to build their own workflows, pipelines, APIs, packages, etc. via modular components including:

  • Collection of pre-trained weather/climate prediction models
  • Collection of pre-trained diagnostic weather models
  • Variety of online and on-prem data sources for initialization, scoring, analysis, etc.
  • IO utilities for exporting predicted data to user friendly formats
  • Suite of perturbation methods for building ensemble predictions
  • Sample workflows and examples for common tasks / use cases
  • Seamless integration into other Nvidia packages including Modulus

For a more complete list of feature set, be sure to view the documentation. Don't see what you need? Great news, extension and customization are at the heart of our design.

Contributors

Check out the Contributing document for details about the technical requirements and the userguide for higher level philosophy, structure, and design.

License

Earth2Studio is provided under the Apache License 2.0, please see LICENSE file for full license text.

earth2studio's People

Contributors

akshaysubr avatar dallasfoster avatar jleinonen avatar nickgeneva avatar sahnimanas avatar seansblee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

earth2studio's Issues

๐Ÿ›[BUG]: Device ignored in run.py functions

Version

0.1.0

On which installation method(s) does this occur?

No response

Describe the issue

In https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/run.py we have

def deterministic(
    time: list[str] | list[datetime] | list[np.datetime64],
    nsteps: int,
    prognostic: PrognosticModel,
    data: DataSource,
    io: IOBackend,
    device: Optional[torch.device] = None,
) -> IOBackend:

but then

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

so I think the device parameter to the function is ignored. There seems to be the same issue with the diagnostic function.

๐Ÿ›[BUG]: Data from ARCO and WB2 is not cached

Version

0.3.0a0

On which installation method(s) does this occur?

Source

Describe the issue

Data downloaded from ARCO and from WB2 is not cached. The respective folders ~/.cache/earth2studio/{arco,wb2} are created but remain empty. Below, a minimal example to reproduce the behaviour. For the run, e2studio was installed from source inside a modulus 24:04 container.

from earth2studio.data import ARCO, WB2Climatology, fetch_data
from numpy import datetime64, timedelta64

for data in (ARCO(cache=True), WB2Climatology(cache=True)):
    xx, meta = fetch_data(
        source=data,
        time=[datetime64('2023-01-01')],
        variable=['t2m'],
        lead_time=[0, (timedelta64(6, 'h')).astype('timedelta64[ns]')],
    )

    print(f'{xx.shape=}')
    print(f'{meta.keys()=}')

๐Ÿš€[FEA]: Change input_coords to function that returns new dictionary

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Improve input_coords handling.

Presently a property this is:

  1. inconsistent with new updated output coords function
  2. Error prone as users can edit the input coord property of a component inplace accidently

Proposed change:

Update the input coords to a function that returns a new dictionary every time its called:

def input_coords(self) -> CoordSystem:
        """Input coordinate system of prognostic model, time dimension should contain
        time-delta objects

        Returns
        -------
        CoordSystem
            Coordinate system dictionary
        """
        pass

๐Ÿ›[BUG]: Remote file systems time out over 5 minutes

Version

main

On which installation method(s) does this occur?

Source

Describe the issue

Default remote systems (seems http) have a default time out of 5 minutes, we should make this really long or disable it completely for slower internet connections since the files can be large.

๐Ÿš€[FEA]: Unify ONNX utils

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Low (would be nice)

Please provide a clear description of problem you would like to solve.

A few of the models use ONNX checkpoints and have the same functions shared across each for creating onnx runtimes.

Can migrate into util file.

๐Ÿ›[BUG]: SphericalGuassian amplitude tensor not set to correct device

Version

main

On which installation method(s) does this occur?

Source

Describe the issue

Using a noise method like this:

noise_amp = torch.zeros(73, 1, 1)
noise_amp[4] = 0.01 # t2m
perturbation = SphericalGaussian(noise_amp)

Produces an error in a workflow thats running on the GPU, the amplitude tensor should be moved to the correct device depending on the device of x in the call function.
https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/perturbation/spherical.py#L84

Should check other perturbations as well.

๐Ÿ›[BUG]: Pytest getting killed on SFNO package test with wheel install

Version

main

On which installation method(s) does this occur?

Pip

Describe the issue

Getting the classic uninformative:

test/models/px/test_sfno.py::test_sfno_package[cpu] make: *** [Makefile:40: pytest] Killed

Likely a memory issue. Some research suggests pytest can cause memory usage to compound between tests...
One potential solution is to change the models loaded to a fixture to load the models scoped to something, seems tricky problem with these large models.

๐Ÿ›[BUG]: Coord ValueError when running run.ensemble() with SFNO

Version

0.1.0

On which installation method(s) does this occur?

Pip

Describe the issue

SFNO_value_error

Steps leading to ValueError

  • Importing makani and earth2studio both with version 0.1.0.
  • Importing SFNO model_sfno = SFNO.load_model(SFNO.load_default_package()) which loads the following checkpoint per default "ngc://models/nvidia/modulus/[email protected]".
  • Running run.ensemble()
io = ensemble(
    [start_date],
    60,
    2,
    model_sfno,
    CDS(),
    ZarrBackend(file_name=output_file),
    Zero(),
    batch_size=4,
    output_coords= {
        "lat": np.arange(0.0, 50.0, 0.25),
        "lon": np.arange(250.0, 345.0, 0.25),
        "variable": np.array(["msl", "u10m", "v10m","t2m"])
    },
)

Notes

  • Running with FCN imported as model_fcn = FCN.load_model(FCN.load_default_package()) executes successfully.
  • Checking the input and output coords after import returns:
    FCN_SFNO_coords
    Obviously, "lead_time" is at index position 1 after import for both models.
  • During handshake_dim of SFNO it is at index 2 while expected at position 1 (comp. ValueError above).

I suspect, that there is either

  1. an issue with coords handling or
  2. working with non-matching dependencies/ checkpoints (makani, earth2studio).

๐Ÿ›[BUG]:Error when running Diagnostic Module: odule 'earth2studio.run' has no attribute 'diagnostic'

Version

0.1.0

On which installation method(s) does this occur?

Pip

Describe the issue

image

Steps leading to the error:

  1. Using the Modulus Docker
  2. Installing Makani and Earth2Studio v0.1.0
  3. Downloading the example notebook file in https://nvidia.github.io/earth2studio/examples/02_diagnostic_workflow.html
  4. Trying to run the run.diagnostic
import earth2studio.run as run

nsteps = 8
io = run.diagnostic(
    ["2021-06-01"], nsteps, prognostic_model, diagnostic_model, data, io
)

print(io.root.tree())

Notes

  1. The Deterministic/Prognostic model works.
  2. No Diagnostic function in .local/lib/python3.10/site-packages/earth2studio/run.py

Do I have to install anything else to get the Diagnostic model running? Thank you

๐Ÿ›[BUG]: ARCO download timeout

Version

Latest from Github

On which installation method(s) does this occur?

Source

Describe the issue

When trying to inference the example workflow from #91 (comment), I get the following error while the script is downloading data:

2024-07-17 05:28:22.207 | INFO     | earth2studio.run:ensemble:294 - Running ensemble inference!
2024-07-17 05:28:22.247 | INFO     | earth2studio.run:ensemble:302 - Inference device: cuda
2024-07-17 05:28:23.070 | DEBUG    | earth2studio.data.arco:fetch_array:200 - Fetching ARCO zarr array for variable: u10m at 2022-01-01T12:00:00
2024-07-17 05:28:24.931 | DEBUG    | earth2studio.data.arco:fetch_array:200 - Fetching ARCO zarr array for variable: v10m at 2022-01-01T12:00:00
<cut output for many variables>
2024-07-17 05:30:19.831 | DEBUG    | earth2studio.data.arco:fetch_array:200 - Fetching ARCO zarr array for variable: z300 at 2022-01-01T12:00:00
Fetching ARCO data:  49%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹                         | 36/73 [01:57<02:04,  3.36s/it]Traceback (most recent call last):
  File "/root/earth2studio/earth2studio/data/arco.py", line 172, in create_data_array
    async for t, v, data in unordered_generator(  # type: ignore[misc,unused-ignore]
  File "/root/earth2studio/earth2studio/data/utils.py", line 251, in unordered_generator
    async for task in _limit_concurrency(func_map, limit):
  File "/root/earth2studio/earth2studio/data/utils.py", line 286, in _limit_concurrency
    done, pending = await asyncio.wait(pending, return_when=asyncio.FIRST_COMPLETED)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/earth2studio-ensemble/ensemble-test.py", line 15, in <module>
    io = ensemble(
  File "/root/earth2studio/earth2studio/run.py", line 308, in ensemble
    x0, coords0 = fetch_data(
  File "/root/earth2studio/earth2studio/data/utils.py", line 70, in fetch_data
    da0 = source(adjust_times, variable)
  File "/root/earth2studio/earth2studio/data/arco.py", line 115, in __call__
    xr_array = asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

It looks like the ARCO data source hardcodes

self.async_timeout = 100

so maybe that is the issue? If to, making the timeout longer and/or user configurable would probably solve the problem.

๐Ÿ›[BUG]: ACC calculation uses wrong mean

Version

v0.1.0

On which installation method(s) does this occur?

Pip, Source

Describe the issue

The calculation of x_bar for ACC should subtract the mean of x_hat from x_hat, not the mean of x. Same for y_bar/y_hat/y. See here on lines 144 and 148.

๐Ÿ›[BUG]: LaggedEnsemble overwrite input data

Version

main

On which installation method(s) does this occur?

Source

Describe the issue

In earth2studio.perturbations.lagged, the LaggedEnsemble perturbation strategy overwrite the passed data array,

x[i] = fetch_data(
                source=self.source,
                time=coords["time"] + lag,
                variable=coords["variable"],
                lead_time=coords["lead_time"],
                device=x.device,
            )[0]

This can lead to bugs if the input array is reused in another perturbation strategy. We propose the following fix:

y = torch.clone(x)
        for i, lag in enumerate(self.lags):
            y[i] = fetch_data(
                source=self.source,
                time=coords["time"] + lag,
                variable=coords["variable"],
                lead_time=coords["lead_time"],
                device=y.device,
            )[0]

๐Ÿ›[BUG]: Inifinite recurrsion with batch coords if batch is missing from coord system

Version

main

On which installation method(s) does this occur?

Source

Describe the issue

Should check if batch exists in input and informative error if not. That or input coords should be a function that returns a copy of the coord dict.

from earth2studio.models.px import SFNO
model = SFNO(None, None, None)
n_coords = model.input_coords
del in_coords['batch']

out_coords = model.output_coords(in_coords)
ile "/code/earth2studio/earth2studio/models/batch.py", line 330, in _wrapper
    flatten_coords, batched_coords = self._compress_batch(model, input_coords)
  File "/code/earth2studio/earth2studio/models/batch.py", line 272, in _compress_batch
    and next(iter(model.output_coords(model.input_coords))) != "batch"
  File "/code/earth2studio/earth2studio/models/batch.py", line 330, in _wrapper
    flatten_coords, batched_coords = self._compress_batch(model, input_coords)
  File "/code/earth2studio/earth2studio/models/batch.py", line 272, in _compress_batch
    and next(iter(model.output_coords(model.input_coords))) != "batch"
  File "/code/earth2studio/earth2studio/models/batch.py", line 330, in _wrapper
    flatten_coords, batched_coords = self._compress_batch(model, input_coords)
  File "/code/earth2studio/earth2studio/models/batch.py", line 272, in _compress_batch
    and next(iter(model.output_coords(model.input_coords))) != "batch"
  File "/code/earth2studio/earth2studio/models/batch.py", line 330, in _wrapper
    flatten_coords, batched_coords = self._compress_batch(model, input_coords)
  File "/code/earth2studio/earth2studio/models/batch.py", line 272, in _compress_batch
    and next(iter(model.output_coords(model.input_coords))) != "batch"
  File "/code/earth2studio/earth2studio/models/batch.py", line 330, in _wrapper
    flatten_coords, batched_coords = self._compress_batch(model, input_coords)

๐Ÿ›[BUG]: SFNO/Makani missing some needed deps and fails on vanilla Pytorch container

Version

main

On which installation method(s) does this occur?

Pip, Source

Describe the issue

Present the SFNO model requires makani install.

This is inside nvcr.io/nvidia/pytorch:24.01-py3

The current deps of makani and also whats in e2studio are missing some items. Not sure what triggered this. I think something got updated in the Modulus version 0.6.0 that removed some deps that Makani assumed would be installed.

Packages needed to get added are:

pip install ruamel.yaml
pip install torch-harmonics
pip install tensorly
pip install tensorly-torch

```

๐Ÿ›[BUG]: Pip install not working

Version

0.1.0

On which installation method(s) does this occur?

Pip

Describe the issue

Pip install presently does not work. Still working on getting final deployment. Install from source in mean time.

๐Ÿš€[FEA]: Improve caching experience

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

Presently the model cache works fine, it operates in the .cache folder.

However all files are stored there, and because of this the file names are hashed to prevent conflicts.
This makes it hard to know what file is what and in some cases, like onnx checkpoints w/ separate weight files, can break things.

I think this can be improved by:

  1. Make name hashing optional, this is allowed in Fsspec and should be an option here
  2. Make models default cache location its own subfolder
  3. Models do not hash file names by default since we control the location, we can safely avoid conflicts

๐Ÿš€[FEA]: Overwriting NetCDF4 output files

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Low (would be nice)

Please provide a clear description of problem you would like to solve.

Trying to write to a NetCDF4 file that already exists currently produces the following error:

AssertionError: Warning! <variable> is already in the NetCDF Store.

This is not particularly clear as to the nature of what went wrong. I also think most people would expect that the default behavior is to overwrite the file if it already exists, as most utilities do. I'd suggest two improvements:

  1. Make the error message clearer as to why it occurred (i.e. the file already existed)
  2. Add an option to overwrite any existing file (maybe even make it the default behavior)

Item 2 should be achievable by using write mode w instead of r+ in:

self.root = Dataset(
file_name,
"r+",
format="NETCDF4",
diskless=diskless,
persist=persist if diskless else False,
)

๐Ÿ›[BUG]: ARCO does not work in notebooks

Version

Current main, commit ee2f30b

On which installation method(s) does this occur?

Source

Describe the issue

Two issues with the ARCO data source:

  1. When using ARCO in a Jupyter notebook, it complains about the event loop already running. Maybe check for an existing event loop and use that instead of calling asyncio.run. Example (in a notebook):
from datetime import datetime
from earth2studio.data import ARCO

arco = ARCO(cache=True, verbose=False)
ds = arco(datetime.fromisoformat("1980-01-01"), ["u10m"])

This will result in:

File /workspace/repos/earth2studio/earth2studio/data/arco.py:122, in ARCO.__call__(self, time, variable)
    119 # Make sure input time is valid
    120 self._validate_time(time)
--> 122 xr_array = asyncio.run(
    123     asyncio.wait_for(self.create_data_array(time, variable), self.async_timeout)
    124 )
    126 # Delete cache if needed
    127 if not self._cache:

File /usr/lib/python3.10/asyncio/runners.py:33, in run(main, debug)
      9 """Execute the coroutine and return the result.
     10 
     11 This function runs the passed coroutine, taking care of
   (...)
     30     asyncio.run(main())
     31 """
     32 if events._get_running_loop() is not None:
---> 33     raise RuntimeError(
     34         "asyncio.run() cannot be called from a running event loop")
     36 if not coroutines.iscoroutine(main):
     37     raise ValueError("a coroutine was expected, got {!r}".format(main))

RuntimeError: asyncio.run() cannot be called from a running event loop
  1. When cache=False, ARCO complains about DistributedManager not being initialized. Same example as above just with cache=False gives the following:
File /workspace/repos/earth2studio/earth2studio/data/arco.py:117, in ARCO.__call__(self, time, variable)
    115 time, variable = prep_data_inputs(time, variable)
    116 # Create cache dir if doesnt exist
--> 117 pathlib.Path(self.cache).mkdir(parents=True, exist_ok=True)
    119 # Make sure input time is valid
    120 self._validate_time(time)

File /workspace/repos/earth2studio/earth2studio/data/arco.py:238, in ARCO.cache(self)
    235 cache_location = os.path.join(datasource_cache_root(), "arco")
    236 if not self._cache:
    237     cache_location = os.path.join(
--> 238         cache_location, f"tmp_{DistributedManager().rank}"
    239     )
    240 return cache_location

File /usr/local/lib/python3.10/dist-packages/modulus/distributed/manager.py:121, in DistributedManager.__init__(self)
    119 def __init__(self):
    120     if not self._is_initialized:
--> 121         raise ModulusUninitializedDistributedManagerWarning()
    122     super().__init__()

ModulusUninitializedDistributedManagerWarning: A DistributedManager object is being instantiated before this singleton class has been initialized. Instantiating a manager before initialization can lead to unexpected results where processes fail to communicate. Initialize the distributed manager via DistributedManager.initialize() before instantiating.

@NickGeneva

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.