aditya-grover / climate-learn Goto Github PK

View Code? Open in Web Editor NEW

291.0 6.0 48.0 15.83 MB

Source code for ClimateLearn

License: MIT License

Python 33.31% Jupyter Notebook 66.69%

climate-change climate-science deep-learning machine-learning

climate-learn's People

Contributors

Stargazers

Watchers

climate-learn's Issues

DataModule error when using 1-pressure level variable

Describe the bug

I tried to use DataModule, for geopotential at 500hPa and also for temperature at 850hPa, as done for surface variables, e.g., 2m temperature or total precipitation. However, when using DataModule for one variable at one pressure level (e.g., geopotential_500), it returns an error when trying to load the data (from load_from_nc).

To Reproduce
Steps to reproduce the behavior:

from climate_learn.utils.datetime import Year, Days, Hours
from climate_learn.data import DataModule

    data_module = DataModule(
        dataset = "ERA5",
        task = "forecasting",
        root_dir = DATADIR,
        in_vars = ["geopotential"],
        out_vars = ["geopotential"],
        train_start_year = Year(2015),
        val_start_year = Year(2016),
        test_start_year = Year(2017),
        end_year = Year(2018),
        pred_range = Days(3),
        subsample = Hours(1),
        batch_size = 128,
        num_workers = 1
    )

The error I got:

      ----> 5 data_module = DataModule(
            6     dataset = "ERA5",
            7     task = "forecasting",
            8     root_dir = DATADIR,
            9     in_vars = ["geopotential"],
           10     out_vars = ["geopotential"],
           11     train_start_year = Year(2015),
           12     val_start_year = Year(2016),
           13     test_start_year = Year(2017),
           14     end_year = Year(2018),
           15     pred_range = Days(3),
           16     subsample = Hours(1),
           17     batch_size = 128,
           18     num_workers = 1
           19 )
      
      File ~/.conda/envs/pyTT/lib/python3.10/site-packages/climate_learn/data/module.py:112, in DataModule.__init__(self, dataset, task, root_dir, in_vars, out_vars, train_start_year, val_start_year, test_start_year, end_year, root_highres_dir, history, window, pred_range, subsample, batch_size, num_workers, pin_memory)
          109 caller = eval(f"{dataset.upper()}{task_string}")
          111 train_years = range(train_start_year, val_start_year)
      --> 112 self.train_dataset = caller(
          113     root_dir,
          114     root_highres_dir,
          115     in_vars,
          116     out_vars,
          117     history,
          118     window,
          119     pred_range.hours(),
          120     train_years,
          121     subsample.hours(),
          122     "train",
          123 )
          125 val_years = range(val_start_year, test_start_year)
          126 self.val_dataset = caller(
          127     root_dir,
          128     root_highres_dir,
         (...)
          136     "val",
          137 )
      
      File ~/.conda/envs/pyTT/lib/python3.10/site-packages/climate_learn/data/modules/era5_module.py:113, in ERA5Forecasting.__init__(self, root_dir, root_highres_dir, in_vars, out_vars, history, window, pred_range, years, subsample, split)
           99 def __init__(
          100     self,
          101     root_dir,
         (...)
          110     split="train",
          111 ):
          112     print(f"Creating {split} dataset")
      --> 113     super().__init__(root_dir, root_highres_dir, in_vars, years, split)
          115     self.in_vars = list(self.data_dict.keys())
          116     self.out_vars = out_vars
      
      File ~/.conda/envs/pyTT/lib/python3.10/site-packages/climate_learn/data/modules/era5_module.py:28, in ERA5.__init__(self, root_dir, root_highres_dir, variables, years, split)
           25 self.years = years
           26 self.split = split
      ---> 28 self.data_dict = self.load_from_nc(self.root_dir)
           29 if self.root_highres_dir is not None:
           30     self.data_highres_dict = self.load_from_nc(self.root_highres_dir)
      
      File ~/.conda/envs/pyTT/lib/python3.10/site-packages/climate_learn/data/modules/era5_module.py:69, in ERA5.load_from_nc(self, data_dir)
           67 if len(xr_data.shape) == 3:  # 8760, 32, 64
           68     xr_data = xr_data.expand_dims(dim="level", axis=1)
      ---> 69     data_dict[var].append(xr_data)
           70 else:  # pressure level
           71     for level in DEFAULT_PRESSURE_LEVELS:
      
      KeyError: 'geopotential'

I checked in some more detail era5_module.py, it seems (to me) that there might be a bug in the for at line 60:

if len(xr_data.shape) == 3: # 8760, 32, 64
xr_data = xr_data.expand_dims(dim="level", axis=1)
data_dict[var].append(xr_data)
in this case, there is no level considered when using a variable at only one pressure level.

or did I miss something in the use case of DataModule?
Thanks!

inconsistency between tutorial notebook and resnet model

Describe the bug
"upsampling" removed in resnet kwargs but still being passed in NeurIPS2022_CCAI_Tutorial.ipynb

To Reproduce
run the downscaling section of NeurIPS2022_CCAI_Tutorial.ipynb

Expected behavior
the model should not accept "upsampling" as a parameter

Screenshots

Climatology is incorrect shape for forecasting with history

Describe the bug
The climatology dimension is incorrect.

To Reproduce
Run the script here: https://gist.github.com/jasonjewik/c339325e4ae33c85e4cecc1356fdae38.

Expected behavior
Instead of getting a 2D tensor of shape [32, 64], I'm getting a 3D tensor of shape [3, 32, 64].

Screenshots

Environment

OS: Ubuntu 22.04
Python version: 3.10
Environment: only dependency is climate-learn

Additional context
Even if I set history to 1, I still get a shape error.

Refactoring of climate_learn.models.load_model to support clean integration with LightningCLI

Hi,

The current way to do forecasting and downscaling involves creating a DataModule and calling load_model which are then passed on to the trainer. The former is inherited from LightningDataModule whereas the later is a function that returns a LightningModule.

For reproducibility purposes and clear distinction between source code and hyper-parameters, I believe refactoring to the load_model should be done.

By refactoring, I mean load_modelshould be converted to class that is inherited from pl.LightningModule and we can then use something like LightningCLI (Similar use case shown in the docs).

This way via YAML only, one can control which class of model, datamodule, etc should be created for a given experiment.

Question Regarding Bilinear Interpolation for Downscaling before DL models

Hello,

I've noticed the use of bilinear interpolation for downscaling climate data in loaders.py here. Given the complexity of climate data, I'm curious about this choice.

Could you share the rationale behind using bilinear interpolation?

Thank you in advance.

Some ideas on how to restructure the data handling of the Climate-learn

Hi,

To add support for #38 in the current code, it can become quite messy as the dataset source class and task class (eg forecasting) are somewhat coupled As a result, to add support for either of new dataset source, task, or data-loading strategy would involve creating loads of new classes and writing redundant code.

To solve this, I believe that the source dataset class, task specific class, and data-loading strategy class (eg IterableDataset) should be decoupled.

I tried to formalize this as requirements that a good implementation/refactor should follow.

DataModule would act as an interface to the entire data handling. (DataModule is inherited from LightningDataModule)
To add a new task/combination of tasks (multitasking), one shouldn’t make change to the source dataset class (eg ERA5).
Writing the code for adding a new data source (eg CMIP6) for an already supported task (eg forecasting) should only involve writing the code for reading and preprocessing the new data source files.
To add support for new way of loading the dataset (eg map-style, streaming, corgipile, etc), one should write code only for the loading strategy and it should be dataset agnostic.

Unfortunately, as of now I can't think of any structuring that satisfies all the requirements. Hence, I opened this as an issue here, hoping someone would chime in and share their thoughts on this.

ImportError for the command "from climate_learn.data import DataModule"

Describe the bug

I am following the code on page 25 in the ClimateLearn paper (https://arxiv.org/pdf/2307.01909.pdf) and when I try to do the import for DataModule I get the following error:

In [1]: from climate_learn.data import DataModule

ImportError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from climate_learn.data import DataModule

ImportError: cannot import name 'DataModule' from 'climate_learn.data' (/home/user/miniconda3/envs/precip/lib/python3.11/site-packages/climate_learn/data/init.py)

Could someone explain why this import is not working? Perhaps it is related to missing information in the file climate_learn/data/init.py.

As a side question, are there any examples of downscaling with a CNN that could be included in another QuickStart notebook example? For example, similar to the one used in the paper. If so, that would be incredibly helpful.

Thank you for your help!
Jeremy

Environment

OS: linux
Python version: 3.11
Environment: conda, used pip install for climate-learn

Installation fails

Dear climate-learn Team!
I tried to install you repository but both methods described in your readme file failed unfortunately.

The first output looks like this:

The second was a bit weirder, after I downloaded your code base and tried to install with the requirements file:

Do you have any ideas what I could do?

Model Refactor

Summary

The models in ClimateLearn are both cumbersome to use (no snappy way to load presets, climatology has to be manually set, baselines are baked into models) and insufficiently flexible (hard to add new models). In this issue, I outline proposed changes to resolve these issues. Ideally, I want the model quickstart to change from this:

dm = DataModule(...)
model_kwargs = {...}
optim_kwargs = {...}
mm = load_model(
    name="resnet",
    task="forecasting",
    model_kwargs=model_kwargs,
    optim_kwargs=optim_kwargs
)
set_climatology(mm, dm)
fit_lin_reg_baseline(mm, dm, reg_hparam=0.0)

To this:

dm = DataModule(...)
mm = load_forecasting_module(dm, preset="rasp-theurey-2020")
trainer = Trainer(...)
trainer.fit(mm, dm)
trainer.test(mm, dm)

Model Loading

Examples

Here, I show examples for load_forecasting_module. Everything is analogous for load_downscaling_module. The reason why I split the original load_model function by task is to mirror the data module design, where ForecastingArgs is a distinct class from DownscalingArgs.

load_forecasting_module(data_module, preset)

Loads a preset model module. For example, the following loads the model and optimizers described in this paper.

load_forecasting_module(dm, preset="rasp-thuerey-2020")

Presets also exist for baselines.

load_forecasting_module(dm, preset="climatology")
load_forecasting_module(dm, preset="persistence")
load_forecasting_module(dm, preset="linear-regression")

load_forecasting_module(data_module, preset, model_kwargs)

Loads a preset model module. The user can also pass keyword arguments to modify the model architecture. For example, the following loads Rasp and Thuerey's model, but changes the dropout.
```
load_forecasting_module(
    dm,
    preset="rasp-theurey-2020",
    model_kwargs={"dropout": 0.3}
)
```
load_forecasting_module(data_module, preset, model_kwargs, optim, optim_kwargs)

Loads a preset model module. The user can also pass keyword arguments to modify the model architecture. They can also specify the name of an optimizer (which is built into ClimateLearn) and keyword arguments for the optimzer. For example,
```
load_forecasting_module(
    dm,
    preset="rasp-theurey-2020",
    model_kwargs={"dropout": 0.3},
    optim="adamw",
    optim_kwargs={"betas": (0.9, 0.95)}
)
```
load_forecasting_module(data_module, preset, model_kwargs, optimizer)

Loads a preset model module. The user can also pass keyword arguments to modify the model architecture. They can also specify an already instantiated optimizer. For exapmle,
```
load_forecasting_module(
    dm,
    preset="rasp-theurey-2020",
    model_kwargs={"dropout": 0.3},
    optimizer=my_cool_optimizer
)
```
load_forecasting_module(data_module, model, model_kwargs, optim, optim_kwargs)

Loads a model module with the given model and optimizer, which are defined in ClimateLearn but can be customized by model_kwargs and optim_kwargs. For example,
```
load_forecasting_module(
    dm,
    model="resnet",
    model_kwargs={"n_blocks": 2},
    optim="adamw",
    optim_kwargs={"betas": (0.9, 0.95)}
)
```
load_forecasting_module(data_module, model, model_kwargs, optimizer)

Loads a model module with the given model, which is defined in ClimateLearn but can be customized by model_kwargs. The optimizer is specified separately. For example:
```
load_forecasting_module(
    deta_module,
    model="resnet",
    model_kwargs={"n_blocks": 2},
    optimizer=my_cool_optimizer
)
```
load_forecasting_module(data_module, net, optimizer)

Loads a model module which wraps the user-specified network and optimizer. For example:
```
load_forecasting_module(
    data_module,
    net=my_cool_network,
    optimizer=my_cool_optimizer,
)
```

Function Signature

load_xxx_module(
    data_module: pl.LightningDataModule,
    preset: Optional[str] = None,
    model: Optional[str] = None,
    model_kwargs: Optional[Dict[str, Any]] = None,
    optim: Optional[str] = None,
    optim_kwargs: Optional[Dict[str, Any]] = None,
    net: Optional[torch.nn.Module] = None,
    optimizer: Optional[Union[torch.optim, Dict[str, torch.optim]]] = None,
    train_loss: Optional[Union[Callable, List[Callable]]] = None,
    val_loss: Optional[Union[Callable, List[Callable]]] = None,
    test_loss: Optional[Union[Callable, List[Callable]]] = None
)

Note that preset and model are aliases for each other. They are kept as two distinct arguments for the sake of clarity. For example the following two function calls return the same module:

load_forecasting_module(dm, preset="rasp-theurey-2020")
load_forecasting_module(dm, model="rasp-theurey-2020")

But in the first case, it is more obvious that the user wants the model which has been defined in Rasp and Theurey (2020). If both preset and model are specified, a RuntimeError will be thrown. This is the same behavior as when net is passed even if model is specified, or any other argument conflicts.

The optimizer argument can either be a PyTorch optimizer or a dictionary which contains two keys: "optimizer" and "lr_scheduler". In the case that it is just a PyTorch optimizer, no scheduler is used for the optimization.

I also add arguments for specifying loss functions. If these are left as None, the default loss functions which are specified in ClimateLearn will be used. However, the user might want this flexibility. For example, someone might be interested in using the AtmoDist loss for downscaling.

How does this solve existing problems?

The user can easily load presets. I've shown this for Rasp and Theurey, but we could also include ClimaX, Weyn et al. (2020), and others. Besides just loading the architectures, when possible, we can also load pre-trained models. For example, we could have both "climax", which loads the untrained ClimaX model, and "climax-pretrained", which loads the pre-trained ClimaX model.
Climatology is set automatically. ClimateLearn requires climatology to be set before training. It doesn't make sense to require the user to remember to do this. Here, climatology is set in the load_xxx_module function. I show how this is done below.
Baselines are not baked into models. As pointed out in Issue 83, it doesn't always make sense to run persistence because the data module might not support it. Furthermore, the user might not care to see these baselines. In my proposed changes, we separate out the baselines into their own models. If the user wishes to run climatology, persistence, or linear regression, they can do that the same as any other model. For example,
```
load_forecasting_module(dm, preset="climatology")
```
New models are easier to add. The user can modify ClimateLearn's presets (e.g., Rasp and Theurey, ClimaX) and built-in architectures (e.g., ResNet, ViT), and they can define their own network and/or optimizer and pass these to the load_xxx_module function. We can include a page in the documentation about what API is expected for forecasting networks versus downscaling networks.

Setting Climatology

In the load_xxx_module function, we can do the following to set climatology automatically.

def load_forecasting_module(dm, ...):
    # ...
    mm = ForecastingLitModule(...)
    mm.set_climatology(dm.get_climatology("all"))
    # ...

This relies upon pull request 81 being merged, and also a minor change to DataModule.get_climatology.

Baselines

For the persistence baseline, we can do the following to determine if it is available.

def load_forecasting_module(dm, ...):
    # ...
    if preset == "persistence":
        if set(dm.out_vars).issubset(dm.in_vars):
            mm = ForecastLitModule(...)
        else:
            raise RuntimeError()
    # ...

Again, this would require just a minor change so that the input variables and the output variables of the dataset are both available at the DataModule level.

Conclusion

In making these changes, I aim for the following two goals. First, to make it easier to run benchmark models. Second, to make it easier to add a custom model. The flexibility of my proposed API allows for a balance between these two goals.

Efficient loading for ERA5 from netcdf using chunks

Is your feature request related to a problem? Please describe.
Currently, the iter approach first saves files into .npz chunks and then reloads them using NpyReader.

Describe the solution you'd like
Instead of duplicating data into a .npz, can we read from the netcdf files without loading all of them into the memory at the same time.

I don't know how to do this. Just creating it as an issue for now.

dimension miss match between ground truth & pred for downscaling task

Describe the bug
During visualization of the downscaling task, there seems to be a mismatch in the dimension of pred and gt

To Reproduce
Simply run the NeurIPS2022_CCAI_Tutorial.ipynb or the Visualization.ipynb in the docs/notebooks of the tutorials-split branch

Screenshots

out_vars in ERA5Forecasting module works only when out_vars is a subset of in_vars

Describe the bug
When calling init for ERA5Forecasting, the arguments passed to it's parent class are (root_dir, root_highres_dir, in_vars, years, split). Because of this while creating the data_dict , only variables that are part of in_vars are loaded from netcdf files. See lines 45 and 61.

This can result in potential bug at line 122 when we create the output data, if the output variables are not a subset of input variables.

ShardDataset doesn't work for DDP

Describe the bug
ShardDataset doesn't work with DDP but works with DDP_spawn. The training just hangs before the start of the first epoch.

Training on 3D variables do not work anymore

Hello, I recently tried to load ERA5 using the updated climate-learn package:

era5_data_module = DataModule(
    dataset = "ERA5",
    task = "forecasting",
    root_dir = era_path,
    in_vars = ["temperature"],
    out_vars = ["temperature"],
    train_start_year = Year(1979),
    val_start_year = Year(2011),
    test_start_year = Year(2013),
    end_year = Year(2014),
    pred_range = Days(5),
    subsample = Hours(6),
    batch_size = 128,
    num_workers = 1
)

Running the above code produces the following error:

KeyError                                  Traceback (most recent call last)
Cell In [16], line 1
----> 1 era5_data_module = DataModule(
      2     dataset = "ERA5",
      3     task = "forecasting",
      4     root_dir = era_path,
      5     in_vars = ["temperature"],
      6     out_vars = ["temperature"],
      7     train_start_year = Year(1979),
      8     val_start_year = Year(2011),
      9     test_start_year = Year(2013),
     10     end_year = Year(2014),
     11     pred_range = Days(5),
     12     subsample = Hours(6),
     13     batch_size = 128,
     14     num_workers = 1
     15 )

File ~/climate-learn/src/climate_learn/data/module.py:58, in DataModule.__init__(self, dataset, task, root_dir, in_vars, out_vars, train_start_year, val_start_year, test_start_year, end_year, root_highres_dir, history, window, pred_range, subsample, batch_size, num_workers, pin_memory)
     55 caller = eval(f"{dataset.upper()}{task_string}")
     57 train_years = range(train_start_year, val_start_year)
---> 58 self.train_dataset = caller(
     59     root_dir,
     60     root_highres_dir,
     61     in_vars,
     62     out_vars,
     63     history,
     64     window,
     65     pred_range.hours(),
     66     train_years,
     67     subsample.hours(),
     68     "train",
     69 )
     71 val_years = range(val_start_year, test_start_year)
     72 self.val_dataset = caller(
     73     root_dir,
     74     root_highres_dir,
   (...)
     82     "val",
     83 )

File ~/climate-learn/src/climate_learn/data/modules/era5_module.py:122, in ERA5Forecasting.__init__(self, root_dir, root_highres_dir, in_vars, out_vars, history, window, pred_range, years, subsample, split)
    119 self.pred_range = pred_range
    121 inp_data = xr.concat([self.data_dict[k] for k in self.in_vars], dim="level")
--> 122 out_data = xr.concat([self.data_dict[k] for k in self.out_vars], dim="level")
    123 self.inp_data = inp_data.to_numpy().astype(np.float32)
    124 self.out_data = out_data.to_numpy().astype(np.float32)

File ~/climate-learn/src/climate_learn/data/modules/era5_module.py:122, in <listcomp>(.0)
    119 self.pred_range = pred_range
    121 inp_data = xr.concat([self.data_dict[k] for k in self.in_vars], dim="level")
--> 122 out_data = xr.concat([self.data_dict[k] for k in self.out_vars], dim="level")
    123 self.inp_data = inp_data.to_numpy().astype(np.float32)
    124 self.out_data = out_data.to_numpy().astype(np.float32)

KeyError: 'temperature'

Same problem happens with other pressure-level variables such as geopotential.

Deterministic randomness in ShardDataset

Is your feature request related to a problem? Please describe.
Currently ShardDataset, implements __iter__() to build batch. Inside the iter, the order of data is determined by self.epoch. Unfortunately, the self.epoch is incremented only for the child process and not for the parent process as a result, each epoch thus results in same shuffling order.

Describe the solution you'd like
Having access to the trainer to retrieve epoch number or set up communication across child processes under the __iter__().

Bug in the persistence baseline for forecasting

Describe the bug
The persistence baseline for forecasting assumes that the last input value can be used as output. But this makes the assumption that the input variables is same as output variables. See this.

Table 3 (Downscaling experiments results) reports `RMSE` and not `LatWeightedRMSE`

Describe the bug
Table 3 in Section 4.2 of the paper "ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling" reports RMSE but in Appendix B.4.3 (Climate downscaling metrics) it points to Latitude Weighted RMSE (Eq. 2 in Appendix). I ran the code locally and confirmed that the numbers reported in the paper are showing RMSE and not Latitude Weighted RMSE.

Also, I have a question: Why lat_mse is used for training the forecasting module but mse is used for training the downscaling module?

Snapshots

Table 3 from the paper

Code snippet

The following snippet shows that load_forecasting_module uses lat_rmse as test_loss but load_downscaling_module uses rmse.

climate-learn/src/climate_learn/utils/loaders.py

Lines 215 to 246 in 1a46b08

 load_forecasting_module = partial( 

 load_model_module, 

 task="forecasting", 

 train_loss="lat_mse", 

 val_loss=["lat_rmse", "lat_acc", "lat_mse"], 

 test_loss=["lat_rmse", "lat_acc"], 

 train_target_transform=None, 

 val_target_transform=["denormalize", "denormalize", None], 

 test_target_transform=["denormalize", "denormalize"], 

 ) 

 load_climatebench_module = partial( 

 load_model_module, 

 task="forecasting", 

 train_loss="mse", 

 val_loss=["mse"], 

 test_loss=["lat_nrmses", "lat_nrmseg", "lat_nrmse"], 

 train_target_transform=None, 

 val_target_transform=[nn.Identity()], 

 test_target_transform=[nn.Identity(), nn.Identity(), nn.Identity()], 

 ) 

 load_downscaling_module = partial( 

 load_model_module, 

 task="downscaling", 

 train_loss="mse", 

 val_loss=["rmse", "pearson", "mean_bias", "mse"], 

 test_loss=["rmse", "pearson", "mean_bias"], 

 train_target_transform=None, 

 val_target_transform=["denormalize", "denormalize", "denormalize", None], 

 test_target_transform=["denormalize", "denormalize", "denormalize"], 

 )

Support for arbitrary optimizer

Is your feature request related to a problem? Please describe.
Not exactly a problem, but it would be nice to have support for optimizers other than just Adam and AdamW.

Describe the solution you'd like
Give user the flexibility to choose his/her own choice of optimizer as long as it is inherited from torch.optim.Optimizer.

CMIP6 data processing

Hello.
I am modifying quickstart.ipynb for cmip6 case. I downloaded CMIP6 data using 'download_mpi_esm1_2_hr'. When I process it next *.nc files next, it throws error. It will be good to have sample scripts for data downloading and processing for other two datasets as well (cmip6 and prism).

I even tried downloading using weatherbench and there I get some other error.

cl.data.download_mpi_esm1_2_hr(
dst="./dataset/cmip6/temperature",
variable="temperature",
)

cl.data.download_mpi_esm1_2_hr(
dst="./dataset/cmip6/geopotential",
variable="geopotential",
)

convert_nc2npz(
root_dir="./dataset/cmip6",
save_dir="./dataset/cmip6/processed",
variables=["temperature", "geopotential"],
start_train_year=1850,
start_val_year=2000,
start_test_year=2005,
end_year=2015,
num_shards=16
)

#########################
Error Message (download_mpi_esm1_2_hr):
#########################

alueError Traceback (most recent call last)
Cell In[4], line 1
----> 1 convert_nc2npz(
2 root_dir="../dataset/cmip6",
3 save_dir="../dataset/cmip6/processed",
4 variables=["temperature", "geopotential"],
5 start_train_year=1850,
6 start_val_year=2000,
7 start_test_year=2005,
8 end_year=2015,
9 num_shards=16
10 )

File /#########################/climate-learn/src/climate_learn/data/processing/nc2npz.py:189, in convert_nc2npz(root_dir, save_dir, variables, start_train_year, start_val_year, start_test_year, end_year, num_shards)
185 test_years = range(start_test_year, end_year)
187 os.makedirs(save_dir, exist_ok=True)
--> 189 nc2np(root_dir, variables, train_years, save_dir, "train", num_shards)
190 nc2np(root_dir, variables, val_years, save_dir, "val", num_shards)
191 nc2np(root_dir, variables, test_years, save_dir, "test", num_shards)

File /#########################/climate-learn/src/climate_learn/data/processing/nc2npz.py:58, in nc2np(path, variables, years, save_dir, partition, num_shards_per_year)
56 for var in variables:
57 ps = glob.glob(os.path.join(path, var, f"{year}.nc"))
---> 58 ds = xr.open_mfdataset(
59 ps, combine="by_coords", parallel=True
60 ) # dataset for a single variable
61 code = NAME_TO_VAR[var]
63 if len(ds[code].shape) == 3: # surface level variables

File /#########################/lib/python3.9/site-packages/xarray/backends/api.py:1046, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
1041 datasets = [preprocess(ds) for ds in datasets]
1043 if parallel:
1044 # calling compute here will return the datasets/file_objs lists,
1045 # the underlying datasets will still be stored as dask arrays
-> 1046 datasets, closers = dask.compute(datasets, closers)
1048 # Combine all datasets, closing them in case of a ValueError
1049 try:

File /#########################/lib/python3.9/site-packages/dask/base.py:595, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
592 keys.append(x.dask_keys())
593 postcomputes.append(x.dask_postcompute())
--> 595 results = schedule(dsk, keys, **kwargs)
596 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File /#########################/lib/python3.9/site-packages/dask/threaded.py:89, in get(dsk, keys, cache, num_workers, pool, **kwargs)
86 elif isinstance(pool, multiprocessing.pool.Pool):
87 pool = MultiprocessingPoolExecutor(pool)
---> 89 results = get_async(
90 pool.submit,
91 pool._max_workers,
92 dsk,
93 keys,
94 cache=cache,
95 get_id=_thread_get_id,
96 pack_exception=pack_exception,
97 **kwargs,
98 )
100 # Cleanup pools associated to dead threads
101 with pools_lock:

File /#########################/lib/python3.9/site-packages/dask/local.py:511, in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
509 _execute_task(task, data) # Re-execute locally
510 else:
--> 511 raise_exception(exc, tb)
512 res, worker_id = loads(res_info)
513 state["cache"][key] = res

File /#########################/lib/python3.9/site-packages/dask/local.py:319, in reraise(exc, tb)
317 if exc.traceback is not tb:
318 raise exc.with_traceback(tb)
--> 319 raise exc

File /#########################/lib/python3.9/site-packages/dask/local.py:224, in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
222 try:
223 task, data = loads(task_info)
--> 224 result = _execute_task(task, data)
225 id = get_id()
226 result = dumps((result, id))

File /#########################/lib/python3.9/site-packages/dask/core.py:121, in _execute_task(arg, cache, dsk)
117 func, args = arg[0], arg[1:]
118 # Note: Don't assign the subtask results to a variable. numpy detects
119 # temporaries by their reference count and can execute certain
120 # operations in-place.
--> 121 return func(*(_execute_task(a, cache) for a in args))
122 elif not ishashable(arg):
123 return arg

File /#########################/lib/python3.9/site-packages/dask/utils.py:73, in apply(func, args, kwargs)
42 """Apply a function given its positional and keyword arguments.
43
44 Equivalent to func(*args, **kwargs)
(...)
70 >>> dsk = {'task-name': task} # adds the task to a low level Dask task graph
71 """
72 if kwargs:
---> 73 return func(*args, **kwargs)
74 else:
75 return func(*args)

File /#########################/lib/python3.9/site-packages/xarray/backends/api.py:547, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
544 kwargs.update(backend_kwargs)
546 if engine is None:
--> 547 engine = plugins.guess_engine(filename_or_obj)
549 if from_array_kwargs is None:
550 from_array_kwargs = {}

File /#########################/lib/python3.9/site-packages/xarray/backends/plugins.py:197, in guess_engine(store_spec)
189 else:
190 error_msg = (
191 "found the following matches with the input file in xarray's IO "
192 f"backends: {compatible_engines}. But their dependencies may not be installed, see:\n"
193 "https://docs.xarray.dev/en/stable/user-guide/io.html \n"
194 "https://docs.xarray.dev/en/stable/getting-started-guide/installing.html"
195 )
--> 197 raise ValueError(error_msg)

ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'scipy']. Consider explicitly selecting one of the installed engines via the engine parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html

The downloaded and processed data is loaded into a PyTorch Lightning data module. In the following code cell, we use the following settings:

#########################
Error Message (weatherbench):
#########################

File /#########################/climate-learn/src/climate_learn/data/processing/nc2npz.py:95, in nc2np(path, variables, years, save_dir, partition, num_shards_per_year)
93 else: # pressure-level variables
94 assert len(ds[code].shape) == 4
---> 95 all_levels = ds["level"][:].to_numpy()
96 all_levels = np.intersect1d(all_levels, DEFAULT_PRESSURE_LEVELS)
97 for level in all_levels:

File /#########################/lib/python3.9/site-packages/xarray/core/dataset.py:1473, in Dataset.getitem(self, key)
1471 return self.isel(**key)
1472 if utils.hashable(key):
-> 1473 return self._construct_dataarray(key)
1474 if utils.iterable_of_hashable(key):
1475 return self._copy_listed(key)

File /#########################/lib/python3.9/site-packages/xarray/core/dataset.py:1384, in Dataset._construct_dataarray(self, name)
1382 variable = self._variables[name]
1383 except KeyError:
-> 1384 _, name, variable = _get_virtual_variable(self._variables, name, self.dims)
1386 needed_dims = set(variable.dims)
1388 coords: dict[Hashable, Variable] = {}

File /#########################/lib/python3.9/site-packages/xarray/core/dataset.py:196, in _get_virtual_variable(variables, key, dim_sizes)
194 split_key = key.split(".", 1)
195 if len(split_key) != 2:
--> 196 raise KeyError(key)
198 ref_name, var_name = split_key
199 ref_var = variables[ref_name]

KeyError: 'level'

Climate-learn should have jupyter as a dependency

The climate_learn.data.DataModule uses tqdm to display progress when loading the dataset.

Unfortunately, it requires Jupyter and ipywidgets as a dependency. Otherwise it throws an error.

Screenshot of error attached below.

I have also added the snippet of code which I tried to run as a reference.

from climate_learn.utils.datetime import Year, Days, Hours
from climate_learn.data import DataModule

def main():

    data_module = DataModule(
        dataset = "ERA5",
        task = "forecasting",
        root_dir = "/data0/datasets/weatherbench/data/weatherbench/era5/5.625deg/",
        in_vars = ["2m_temperature"],
        out_vars = ["2m_temperature"],
        train_start_year = Year(2005),
        val_start_year = Year(2015),
        test_start_year = Year(2016),
        end_year = Year(2016),
        pred_range = Days(3),
        subsample = Hours(6),
        batch_size = 128,
        num_workers = 1
    )

if __name__ == "__main__":
    main()

This issue is resolved by running pip install jupyter.

Bug in climatology baseline

Describe the bug
Forecasting does not work with multiple variables as input/output

To Reproduce

data_module = DataModule(
    dataset = "ERA5",
    task = "forecasting",
    root_dir = path,
    in_vars = ["2m_temperature", "total_cloud_cover"],
    out_vars = ["2m_temperature", "total_cloud_cover"],
    train_start_year = Year(2010), # change
    val_start_year = Year(2015),
    test_start_year = Year(2017),
    end_year = Year(2018),
    pred_range = Days(3),
    subsample = Hours(6),
    batch_size = 32,
    num_workers= 64,
)

model_kwargs = {
    "in_channels": len(data_module.hparams.in_vars),
    "out_channels": len(data_module.hparams.out_vars),
    "n_blocks": 4
}

optim_kwargs = {
    "lr": 1e-4,
    "weight_decay": 1e-5,
    "warmup_epochs": 1,
    "max_epochs": 5,
}

model_module = load_model(name = "resnet", task = "forecasting", model_kwargs = model_kwargs, optim_kwargs = optim_kwargs)

set_climatology(model_module, data_module)
fit_lin_reg_baseline(model_module, data_module, reg_hparam=0.0)

from climate_learn.training import Trainer

trainer = Trainer(
    seed = 0,
    accelerator = "gpu",
    precision = 16,
    max_epochs = 1,
)

trainer.fit(model_module, data_module)

Environment

OS: Ubuntu 20.04
Python version: Python 3.9.7
Environment: all dependencies work otherwise

Additional context
When the climatology baseline is removed from test_step, the code works as expected.

Zero mean for precipitation

Hi there!
Thank you for sharing the repository!

During the processing of raw precipitation data there are 2 lines where mean values are set to zero (link is below).
Could you please explain the reason behind this?

https://github.com/aditya-grover/climate-learn/blob/b48fb0242acc47e365af86bfbd9dd86e9dcbd6d2/src/climate_learn/data/processing/nc2npz.py#L152C1-L153C77

Best, Daria

Current ViT implementation works with timm 0.6.12 and not with 0.9.2

Describe the bug
Running pytest tests/ gives an error for test_vit() when running with the version 0.9.2 but works perfectly with version 0.6.12.

To Reproduce
Create a fresh conda environment with python 3.7 and install all the dependencies.

Screenshots

Possible bug in visualizing samples

Describe the bug
climate_learn.utils.visualize() prints the image upside down and with flipped colormap.

To Reproduce
I am merely running the demo notebook found here - https://colab.research.google.com/drive/1WiNEK1BHsiGzo_bT9Fcm8lea2H_ghNfa

Expected behavior
I believe the map plots produced by visualize() should have the same orientation and colormap as the stock images provided in the above notebook i.e. northern hemisphere on top and in the red-blue colormap red should correspond to higher values and blue to lower.

Screenshots

Environment
N/A
Additional context
N/A

Data folder under the climate-learn has missing docs

Describe the documentation issue
As the title says, the data folder has almost no documentation.

Describe the solution you'd like
Detailed documentation including but not limited to doc-strings, type-hinting, comments for complex pieces of code.

Add these datasets to the Hugging Face hub?

I am interested in using your datasets but they are all too large to fit on my machine. On way around this would be to use the Hugging Face Datasets library, which allows you to stream the dataset input. However, it seems like to upload them to the Hub you first must have the files downloaded locally in the first place. Is it possible for you to add these datasets to the Hub? Thanks in advance.

Repetition in set_denormalization function

set_denormalization function in climate_learn.models.modules.forecast.py has repetition of lines.

Particularly this piece of code is repeated thrice.
mean_mean_denorm, mean_std_denorm = -mean / std, 1 / std
self.mean_denormalize = transforms.Normalize(mean_mean_denorm, mean_std_denorm)

std_mean_denorm, std_std_denorm = np.zeros_like(std), 1 / std
self.std_denormalize = transforms.Normalize(std_mean_denorm, std_std_denorm)

Changing order of variables drastically affects model performance (DataModule)

Describe the bug
As mentioned in the title, changing the ordering of variables listed in the argument to data module has a significant effect on model performance for forecasting

To Reproduce
Steps to reproduce the behavior:
Instantiate data module as follows:

data_args = ERA5Args(
    root_dir=f"{root}/data/{source}/{dataset}/{resolution}/",
    variables=['geopotential', 'u_component_of_wind', 'v_component_of_wind', 'temperature', 'specific_humidity', '2m_temperature'],
    years=years
)

forecasting_args = ForecastingArgs(
    dataset_args=data_args,
    in_vars=['geopotential', 'u_component_of_wind', 'v_component_of_wind', 'temperature', 'specific_humidity', '2m_temperature'],
    out_vars=["temperature_850", "geopotential_500", "2m_temperature"],
    pred_range=3*24
)

data_module_args = DataModuleArgs(
    task_args=forecasting_args,
    train_start_year=1979,
    val_start_year=2015,
    test_start_year=2017,
    end_year=2018
)

data_module = DataModule(
    data_module_args=data_module_args,
    batch_size=128,
    num_workers=1
)

The code snippet was taken from the Model_Training_Evaluation notebook, and the rest of the code is identical to the notebook. Due to the poor performance, I swapped the order of variables for data_args:

data_args = ERA5Args(
    root_dir=f"{root}/data/{source}/{dataset}/{resolution}/",
    variables=['temperature', 'geopotential', 'u_component_of_wind', 'v_component_of_wind', 'specific_humidity', '2m_temperature'],
    years=years
)

Expected behavior
The two experiments should have (I think) little difference in performance, but performance was drastically affected.

Screenshots

Additional context
Are my observations expected behavior? I am not sure if we are completely abandoning the old DataModule for the IterDataModule; if so I will remove this issue.

TypeError: DataModule.init() got an unexpected keyword argument 'dataset'

Describe the bug
Got this error: TypeError: DataModule.init() got an unexpected keyword argument 'dataset', even though the docs stated that it's part of the parameters.

To Reproduce
Steps to reproduce the behavior:
I just ran this:

from climate_learn.utils.datetime import Year, Days, Hours
from climate_learn.data import DataModule

data_module = DataModule(
    dataset = "ERA5",
    task = "forecasting",
    root_dir = "/content/drive/MyDrive/Climate/.climate_tutorial/data/weatherbench/era5/5.625/",
    in_vars = ["2m_temperature"],
    out_vars = ["2m_temperature"],
    train_start_year = Year(1979),
    val_start_year = Year(2015),
    test_start_year = Year(2017),
    end_year = Year(2018),
    pred_range = Days(3),
    subsample = Hours(6),
    batch_size = 128,
    num_workers = 1
)

Error traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-11-6ff523560500>](https://localhost:8080/#) in <cell line: 4>()
      2 from climate_learn.data import DataModule
      3 
----> 4 data_module = DataModule(
      5     dataset = "ERA5",
      6     task = "forecasting",

TypeError: DataModule.__init__() got an unexpected keyword argument 'dataset'

Expected behavior
Code can run smoothly

Screenshots

Environment

OS: M1 Macbook Air
Python version: Python 3.10.11 (main, Apr 5 2023, 14:15:10) [GCC 9.4.0]
Environment:

Package                       Version
----------------------------- --------------------
absl-py                       1.4.0
aiohttp                       3.8.4
aiosignal                     1.3.1
alabaster                     0.7.13
albumentations                1.2.1
altair                        4.2.2
anyio                         3.6.2
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arviz                         0.15.1
astropy                       5.2.2
astunparse                    1.6.3
async-timeout                 4.0.2
attrs                         23.1.0
audioread                     3.0.0
autograd                      1.5
Babel                         2.12.1
backcall                      0.2.0
beautifulsoup4                4.11.2
bleach                        6.0.0
blis                          0.7.9
blosc2                        2.0.0
bokeh                         2.4.3
branca                        0.6.0
CacheControl                  0.12.11
cached-property               1.5.2
cachetools                    5.3.0
catalogue                     2.0.8
cdsapi                        0.6.1
certifi                       2022.12.7
cffi                          1.15.1
cftime                        1.6.2
chardet                       4.0.0
charset-normalizer            2.0.12
chex                          0.1.7
click                         8.1.3
climate-learn                 0.0.2
cloudpickle                   2.2.1
cmake                         3.25.2
cmdstanpy                     1.1.0
colorcet                      3.0.1
colorlover                    0.3.0
community                     1.0.0b1
confection                    0.0.4
cons                          0.4.5
contextlib2                   0.6.0.post1
contourpy                     1.0.7
convertdate                   2.4.0
cryptography                  40.0.2
cufflinks                     0.17.3
cupy-cuda11x                  11.0.0
cvxopt                        1.3.0
cvxpy                         1.3.1
cycler                        0.11.0
cymem                         2.0.7
Cython                        0.29.34
dask                          2022.12.1
datascience                   0.17.6
db-dtypes                     1.1.1
dbus-python                   1.2.16
debugpy                       1.6.6
decorator                     4.4.2
defusedxml                    0.7.1
distributed                   2022.12.1
dlib                          19.24.1
dm-tree                       0.1.8
docker-pycreds                0.4.0
docutils                      0.16
dopamine-rl                   4.0.6
duckdb                        0.7.1
earthengine-api               0.1.350
easydict                      1.10
ecos                          2.0.12
editdistance                  0.6.2
en-core-web-sm                3.5.0
entrypoints                   0.4
ephem                         4.1.4
et-xmlfile                    1.1.0
etils                         1.2.0
etuples                       0.3.8
exceptiongroup                1.1.1
fastai                        2.7.12
fastcore                      1.5.29
fastdownload                  0.0.7
fastjsonschema                2.16.3
fastprogress                  1.0.3
fastrlock                     0.8.1
filelock                      3.12.0
firebase-admin                5.3.0
Flask                         2.2.4
flatbuffers                   23.3.3
flax                          0.6.9
folium                        0.14.0
fonttools                     4.39.3
frozendict                    2.3.7
frozenlist                    1.3.3
fsspec                        2023.4.0
future                        0.18.3
gast                          0.4.0
GDAL                          3.3.2
gdown                         4.6.6
gensim                        4.3.1
geographiclib                 2.0
geopy                         2.3.0
gin-config                    0.5.0
gitdb                         4.0.10
GitPython                     3.1.31
glob2                         0.7
google                        2.0.3
google-api-core               2.11.0
google-api-python-client      2.84.0
google-auth                   2.17.3
google-auth-httplib2          0.1.0
google-auth-oauthlib          1.0.0
google-cloud-bigquery         3.9.0
google-cloud-bigquery-storage 2.19.1
google-cloud-core             2.3.2
google-cloud-datastore        2.15.1
google-cloud-firestore        2.11.0
google-cloud-language         2.9.1
google-cloud-storage          2.8.0
google-cloud-translate        3.11.1
google-colab                  1.0.0
google-crc32c                 1.5.0
google-pasta                  0.2.0
google-resumable-media        2.5.0
googleapis-common-protos      1.59.0
googledrivedownloader         0.4
graphviz                      0.20.1
greenlet                      2.0.2
grpcio                        1.54.0
grpcio-status                 1.48.2
gspread                       3.4.2
gspread-dataframe             3.0.8
gym                           0.25.2
gym-notices                   0.0.8
h5netcdf                      1.1.0
h5py                          3.8.0
hijri-converter               2.3.1
holidays                      0.23
holoviews                     1.15.4
html5lib                      1.1
httpimport                    1.3.0
httplib2                      0.21.0
huggingface-hub               0.14.1
humanize                      4.6.0
hyperopt                      0.2.7
idna                          3.4
imageio                       2.25.1
imageio-ffmpeg                0.4.8
imagesize                     1.4.1
imbalanced-learn              0.10.1
imgaug                        0.4.0
importlib-metadata            4.13.0
importlib-resources           5.12.0
imutils                       0.5.4
inflect                       6.0.4
iniconfig                     2.0.0
intel-openmp                  2023.1.0
ipykernel                     5.5.6
ipython                       7.34.0
ipython-genutils              0.2.0
ipython-sql                   0.4.1
ipywidgets                    7.7.1
itsdangerous                  2.1.2
jax                           0.4.8
jaxlib                        0.4.7+cuda11.cudnn86
jieba                         0.42.1
Jinja2                        3.1.2
joblib                        1.2.0
jsonpickle                    3.0.1
jsonschema                    4.3.3
jupyter-client                6.1.12
jupyter-console               6.1.0
jupyter_core                  5.3.0
jupyter-server                1.24.0
jupyterlab-pygments           0.2.2
jupyterlab-widgets            3.0.7
kaggle                        1.5.13
keras                         2.12.0
kiwisolver                    1.4.4
korean-lunar-calendar         0.3.1
langcodes                     3.3.0
lazy_loader                   0.2
libclang                      16.0.0
librosa                       0.10.0.post2
lightgbm                      3.3.5
lightning-utilities           0.8.0
lit                           16.0.2
llvmlite                      0.39.1
locket                        1.0.0
logical-unification           0.4.5
LunarCalendar                 0.0.9
lxml                          4.9.2
Markdown                      3.4.3
markdown-it-py                2.2.0
MarkupSafe                    2.1.2
matplotlib                    3.7.1
matplotlib-inline             0.1.6
matplotlib-venn               0.11.9
mdurl                         0.1.2
miniKanren                    1.0.3
missingno                     0.5.2
mistune                       0.8.4
mizani                        0.8.1
mkl                           2019.0
ml-dtypes                     0.1.0
mlxtend                       0.14.0
more-itertools                9.1.0
moviepy                       1.0.3
mpmath                        1.3.0
msgpack                       1.0.5
multidict                     6.0.4
multipledispatch              0.6.0
multitasking                  0.0.11
murmurhash                    1.0.9
music21                       8.1.0
natsort                       8.3.1
nbclient                      0.7.4
nbconvert                     6.5.4
nbformat                      5.8.0
nest-asyncio                  1.5.6
netCDF4                       1.6.3
networkx                      3.1
nibabel                       3.0.2
nltk                          3.8.1
notebook                      6.4.8
numba                         0.56.4
numexpr                       2.8.4
numpy                         1.22.4
oauth2client                  4.1.3
oauthlib                      3.2.2
opencv-contrib-python         4.7.0.72
opencv-python                 4.7.0.72
opencv-python-headless        4.7.0.72
openpyxl                      3.0.10
opt-einsum                    3.3.0
optax                         0.1.5
orbax-checkpoint              0.2.1
osqp                          0.6.2.post8
packaging                     23.1
palettable                    3.3.3
pandas                        1.5.3
pandas-datareader             0.10.0
pandas-gbq                    0.17.9
pandocfilters                 1.5.0
panel                         0.14.4
param                         1.13.0
parso                         0.8.3
partd                         1.4.0
pathlib                       1.0.1
pathtools                     0.1.2
pathy                         0.10.1
patsy                         0.5.3
pep517                        0.13.0
pexpect                       4.8.0
pickleshare                   0.7.5
Pillow                        8.4.0
pip                           23.0.1
pip-tools                     6.6.2
platformdirs                  3.3.0
plotly                        5.13.1
plotnine                      0.10.1
pluggy                        1.0.0
polars                        0.17.3
pooch                         1.6.0
portpicker                    1.3.9
prefetch-generator            1.0.3
preshed                       3.0.8
prettytable                   0.7.2
proglog                       0.1.10
progressbar2                  4.2.0
prometheus-client             0.16.0
promise                       2.3
prompt-toolkit                3.0.38
prophet                       1.1.2
proto-plus                    1.22.2
protobuf                      3.20.3
psutil                        5.9.5
psycopg2                      2.9.6
ptyprocess                    0.7.0
py-cpuinfo                    9.0.0
py4j                          0.10.9.7
pyarrow                       9.0.0
pyasn1                        0.5.0
pyasn1-modules                0.3.0
pycocotools                   2.0.6
pycparser                     2.21
pyct                          0.5.0
pydantic                      1.10.7
pydata-google-auth            1.7.0
pydot                         1.4.2
pydot-ng                      2.0.0
pydotplus                     2.0.2
PyDrive                       1.3.1
pyerfa                        2.0.0.3
pygame                        2.3.0
Pygments                      2.14.0
PyGObject                     3.36.0
pymc                          5.1.2
PyMeeus                       0.5.12
pymystem3                     0.2.0
PyOpenGL                      3.1.6
pyparsing                     3.0.9
pyrsistent                    0.19.3
PySocks                       1.7.1
pytensor                      2.10.1
pytest                        7.2.2
python-apt                    0.0.0
python-dateutil               2.8.2
python-louvain                0.16
python-slugify                8.0.1
python-utils                  3.5.2
pytorch-lightning             2.0.2
pytz                          2022.7.1
pytz-deprecation-shim         0.1.0.post0
pyviz-comms                   2.2.1
PyWavelets                    1.4.1
PyYAML                        6.0
pyzmq                         23.2.1
qdldl                         0.1.7
qudida                        0.0.4
regex                         2022.10.31
requests                      2.27.1
requests-oauthlib             1.3.1
requests-unixsocket           0.2.0
rich                          13.3.4
rpy2                          3.5.5
rsa                           4.9
scikit-image                  0.19.3
scikit-learn                  1.2.2
scipy                         1.10.1
scs                           3.2.3
seaborn                       0.12.2
Send2Trash                    1.8.0
sentry-sdk                    1.22.1
setproctitle                  1.3.2
setuptools                    67.7.2
shapely                       2.0.1
six                           1.16.0
sklearn-pandas                2.2.0
smart-open                    6.3.0
smmap                         5.0.0
sniffio                       1.3.0
snowballstemmer               2.2.0
sortedcontainers              2.4.0
soundfile                     0.12.1
soupsieve                     2.4.1
soxr                          0.3.5
spacy                         3.5.2
spacy-legacy                  3.0.12
spacy-loggers                 1.0.4
Sphinx                        3.5.4
sphinxcontrib-applehelp       1.0.4
sphinxcontrib-devhelp         1.0.2
sphinxcontrib-htmlhelp        2.0.1
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.3
sphinxcontrib-serializinghtml 1.1.5
SQLAlchemy                    2.0.10
sqlparse                      0.4.4
srsly                         2.4.6
statsmodels                   0.13.5
sympy                         1.11.1
tables                        3.8.0
tabulate                      0.8.10
tblib                         1.7.0
tenacity                      8.2.2
tensorboard                   2.12.2
tensorboard-data-server       0.7.0
tensorboard-plugin-wit        1.8.1
tensorflow                    2.12.0
tensorflow-datasets           4.8.3
tensorflow-estimator          2.12.0
tensorflow-gcs-config         2.12.0
tensorflow-hub                0.13.0
tensorflow-io-gcs-filesystem  0.32.0
tensorflow-metadata           1.13.1
tensorflow-probability        0.19.0
tensorstore                   0.1.36
termcolor                     2.3.0
terminado                     0.17.1
text-unidecode                1.3
textblob                      0.17.1
tf-slim                       1.1.0
thinc                         8.1.9
threadpoolctl                 3.1.0
tifffile                      2023.4.12
timm                          0.6.13
tinycss2                      1.2.1
toml                          0.10.2
tomli                         2.0.1
toolz                         0.12.0
torch                         2.0.0+cu118
torchaudio                    2.0.1+cu118
torchdata                     0.6.0
torchmetrics                  0.11.4
torchsummary                  1.5.1
torchtext                     0.15.1
torchvision                   0.15.1+cu118
tornado                       6.2
tqdm                          4.65.0
traitlets                     5.7.1
triton                        2.0.0
tweepy                        4.13.0
typer                         0.7.0
typing_extensions             4.5.0
tzdata                        2023.3
tzlocal                       4.3
uritemplate                   4.1.1
urllib3                       1.26.15
vega-datasets                 0.9.0
wandb                         0.15.2
wasabi                        1.1.1
wcwidth                       0.2.6
webcolors                     1.13
webencodings                  0.5.1
websocket-client              1.5.1
Werkzeug                      2.3.0
wheel                         0.40.0
widgetsnbextension            3.6.4
wordcloud                     1.8.2.2
wrapt                         1.14.1
xarray                        2022.12.0
xarray-einstats               0.5.1
xgboost                       1.7.5
xlrd                          2.0.1
yarl                          1.9.2
yellowbrick                   1.5
yfinance                      0.2.18
zict                          3.0.0
zipp                          3.15.0

Support for IterableDataset

Hi,

Currently the ERA5 class is inherited from the torch.utils.data.dataset.

For tasks involving a lot of input and output variables, it becomes impossible to load them all in the RAM at the same time. Hence, I was wondering if there would be a support for IterableDataset. It would support loading only a subpart of the data in the RAM.

Forecasting doesn't implement subsampling

Forecasting class takes input subsample as the args but when building the inp_data and out_data, it ignores the subsample.
The other way to implement would be to keep the entire inp_data and out_data but implementing subsampling logic in the indexing.

The later seems straight forward but wastes some extra space. For the former, it would require some effort to deal with cases when window and pred_range are not factor of subsample.

Prefetching for iter() in ShardDataset

Is your feature request related to a problem? Please describe.
Currently, ShardDataset loads a chunk and only after it exhausts it, it loads the next one. It would be great to hide the latency by implementing some sort of prefetching.

Describe the solution you'd like
I have no clue on how to attain it but would love to hear other's thoughts.

map_dataset.setup() keeps crashing

Describe the bug
I am trying to run the extreme events file present in src/climate_learn/data/processing/era5_extreme.py, but the map_dataset.setup() function keeps crashing. It is using more than 12 GB RAM and crashes after that.

Please suggest a solution.
Thank you in advance!

Environment

OS: Ubuntu
Environment: conda
I even tried it on colab pro.

Separation of constants and input variables in the data handling

Is your feature request related to a problem? Please describe.
Currently the ERA5 module takes as input the variables and then based on the contents in the root_dir, for each of the input variables it then considers them either as variable or constant.

Describe the solution you'd like
ERA5 init arguments should support taking input constants and assert that that input variables are not constants.

Save predictions as an `nc` file at test time

Is your feature request related to a problem? Please describe.
The feature request is motivated by the following problems:

I was testing climate-learn downscaling against my models and wanted to see if there is a pattern in errors, e.g., during certain hours of the day, predictions are more erroneous. trainer.test(model, dm) provides a nice summary of metrics but does not save the predictions along with lat, lon, time.
If someone wants to visualize the predictions side-by-side for a particular date from different models, it seems difficult to do it currently.

Describe the solution you'd like
A possible way could be to use a flow similar to cl.utils.visualize_at_index function.

climate-learn/src/climate_learn/utils/visualize.py

Lines 10 to 12 in 1a46b08

 def visualize_at_index(mm, dm, in_transform, out_transform, variable, src, index=0): 

 lat, lon = dm.get_lat_lon() 

 extent = [lon.min(), lon.max(), lat.min(), lat.max()]

There could be a function cl.utils.save_nc which may look like:

def save_nc(mm, dm, in_transform, out_transform, variable, src, save_dir):
    ...

This function can save the predictions in exactly the same format as the data nc files with lat, lon and time co-ordinates. It should be able to retrieve lat, lon, time from data module dm.

Willingness to work on a PR
I'll be happy to work on a PR to make this happen!

Additional context
This feature may also be useful to climate researchers who want to produce a time-lapse video of predictions with other libraries with additional geolayers similar to these examples in geemap library.

Statistical Downscaling of other ERA5 Variables

So I am attempting to downscale ERA5 Sea Surface Temperature Variable, I was following along your tutorial at NeurIPS2022 CCAI. I noticed there you used 5 degree and 2 Degree Resolutions for 2m_temperature, Why is this done? It is not very clear. For Sea Surface Temperature i have data at 0.25 Degree resolution, but do i need a Coarser resolution to get this code to work for my chosen variable?

downscaling script

Bug Description

I am encountering an issue when attempting to replicate results for a downscaling problem. During the execution of the code, I encounter the following error:

TypeError: load_model_module() got an unexpected keyword argument 'preset'
This error occurs in the code found at this link:

climate-learn/experiments/downscaling/era5_era5_baselines.py

Line 33 in b48fb02

 nearest = cl.load_downscaling_module(data_module=dm, preset="nearest-interpolation") 

When I replace the argument "preset" with "architecture", the test produces no results, outputting an empty array: [{}].

Could you provide guidance on how to accurately reproduce the results from the paper?

Error when downloading the high-res data

Hello, I got the following error when I try to download the 2.8125 res data of geopotential_500 with the following code:

cl.data.download_weatherbench(
    f"{root_directory}/geopotential",
    dataset="era5",
    variable="geopotential_500",
    resolution=2.8125 
)

The error message is that:

climate_learn/data/download.py:85, in download_weatherbench(dst, dataset, variable, resolution)
     83         file.write(chunk)
     84 if ext == ".zip":
---> 85     with ZipFile(local_fn) as myzip:
     86         myzip.extractall(dst)
     87     os.unlink(local_fn)

BadZipFile: File is not a zip file

Could you help me to figure out the problem? Thanks!

	load_forecasting_module = partial(
	load_model_module,
	task="forecasting",
	train_loss="lat_mse",
	val_loss=["lat_rmse", "lat_acc", "lat_mse"],
	test_loss=["lat_rmse", "lat_acc"],
	train_target_transform=None,
	val_target_transform=["denormalize", "denormalize", None],
	test_target_transform=["denormalize", "denormalize"],
	)

	load_climatebench_module = partial(
	load_model_module,
	task="forecasting",
	train_loss="mse",
	val_loss=["mse"],
	test_loss=["lat_nrmses", "lat_nrmseg", "lat_nrmse"],
	train_target_transform=None,
	val_target_transform=[nn.Identity()],
	test_target_transform=[nn.Identity(), nn.Identity(), nn.Identity()],
	)

	load_downscaling_module = partial(
	load_model_module,
	task="downscaling",
	train_loss="mse",
	val_loss=["rmse", "pearson", "mean_bias", "mse"],
	test_loss=["rmse", "pearson", "mean_bias"],
	train_target_transform=None,
	val_target_transform=["denormalize", "denormalize", "denormalize", None],
	test_target_transform=["denormalize", "denormalize", "denormalize"],
	)

	def visualize_at_index(mm, dm, in_transform, out_transform, variable, src, index=0):
	lat, lon = dm.get_lat_lon()
	extent = [lon.min(), lon.max(), lat.min(), lat.max()]

aditya-grover / climate-learn Goto Github PK

climate-learn's People

Contributors

Stargazers

Watchers

Forkers

climate-learn's Issues

In [1]: from climate_learn.data import DataModule

Model Refactor

Summary

Model Loading

Examples

Function Signature

How does this solve existing problems?

Setting Climatology

Baselines

Conclusion

Table 3 from the paper

Code snippet

Recommend Projects

Recommend Topics

Recommend Org

Jobs