GithubHelp home page GithubHelp logo

openclimatefix / skillful_nowcasting Goto Github PK

View Code? Open in Web Editor NEW
204.0 5.0 57.0 197 KB

Implementation of DeepMind's Deep Generative Model of Radar (DGMR) https://arxiv.org/abs/2104.00954

License: MIT License

Python 100.00%
nowcasting nowcasting-models nowcasting-precipitation gan pytorch-implementation pytorch pytorch-lightning

skillful_nowcasting's Introduction

Skillful Nowcasting with Deep Generative Model of Radar (DGMR)

All Contributors

Implementation of DeepMind's Skillful Nowcasting GAN Deep Generative Model of Radar (DGMR) (https://arxiv.org/abs/2104.00954) in PyTorch Lightning.

This implementation matches as much as possible the pseudocode released by DeepMind. Each of the components (Sampler, Context conditioning stack, Latent conditioning stack, Discriminator, and Generator) are normal PyTorch modules. As the model training is a bit complicated, the overall architecture is wrapped in PyTorch Lightning.

The default parameters match what is written in the paper.

Installation

Clone the repository, then run

pip install -r requirements.txt
pip install -e .

Alternatively, you can also install through pip install dgmr

Training Data

The open-sourced UK training dataset has been mirrored to HuggingFace Datasets! This should enable training the original architecture on the original data for reproducing the results from the paper. The full dataset is roughly 1TB in size, and unfortunately, streaming the data from HF Datasets doesn't seem to work, so it has to be cached locally. We have added the sample dataset as well though, which can be directly streamed from GCP without costs.

The dataset can be loaded with

from datasets import load_dataset

dataset = load_dataset("openclimatefix/nimrod-uk-1km")

For now, only the sample dataset support streaming in, as its data files are hosted on GCP, not HF, so it can be used with:

from datasets import load_dataset

dataset = load_dataset("openclimatefix/nimrod-uk-1km", "sample", streaming=True)

The authors also used MRMS US precipitation radar data as another comparison. While that dataset was not released, the MRMS data is publicly available, and we have made that data available on HuggingFace Datasets as well here. This dataset is the raw 3500x7000 contiguous US MRMS data for 2016 through May 2022, is a few hundred GBs in size, with sporadic updates to more recent data planned. This dataset is in Zarr format, and can be streamed without caching locally through

from datasets import load_dataset

dataset = load_dataset("openclimatefix/mrms", "default_sequence", streaming=True)

This steams the data with 24 timesteps per example, just like the UK DGMR dataset. To get individual MRMS frames, instead of a sequence, this can be achieved through

from datasets import load_dataset

dataset = load_dataset("openclimatefix/mrms", "default", streaming=True)

Pretrained Weights

Pretrained weights are be available through HuggingFace Hub, currently weights trained on the sample dataset. The whole DGMR model or different components can be loaded as the following:

from dgmr import DGMR, Sampler, Generator, Discriminator, LatentConditioningStack, ContextConditioningStack
model = DGMR.from_pretrained("openclimatefix/dgmr")
sampler = Sampler.from_pretrained("openclimatefix/dgmr-sampler")
discriminator = Discriminator.from_pretrained("openclimatefix/dgmr-discriminator")
latent_stack = LatentConditioningStack.from_pretrained("openclimatefix/dgmr-latent-conditioning-stack")
context_stack = ContextConditioningStack.from_pretrained("openclimatefix/dgmr-context-conditioning-stack")
generator = Generator(conditioning_stack=context_stack, latent_stack=latent_stack, sampler=sampler)

Example Usage

from dgmr import DGMR
import torch.nn.functional as F
import torch

model = DGMR(
        forecast_steps=4,
        input_channels=1,
        output_shape=128,
        latent_channels=384,
        context_channels=192,
        num_samples=3,
    )
x = torch.rand((2, 4, 1, 128, 128))
out = model(x)
y = torch.rand((2, 4, 1, 128, 128))
loss = F.mse_loss(y, out)
loss.backward()

Citation

@article{ravuris2021skillful,
  author={Suman Ravuri and Karel Lenc and Matthew Willson and Dmitry Kangin and Remi Lam and Piotr Mirowski and Megan Fitzsimons and Maria Athanassiadou and Sheleem Kashem and Sam Madge and Rachel Prudden Amol Mandhane and Aidan Clark and Andrew Brock and Karen Simonyan and Raia Hadsell and Niall Robinson Ellen Clancy and Alberto Arribas† and Shakir Mohamed},
  title={Skillful Precipitation Nowcasting using Deep Generative Models of Radar},
  journal={Nature},
  volume={597},
  pages={672--677},
  year={2021}
}

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Jacob Bieker
Jacob Bieker

💻
Johan Mathe
Johan Mathe

💻
Z1YUE
Z1YUE

🐛
Nan.Y
Nan.Y

💬
Taisanai
Taisanai

💬
cameron
cameron

💬
zhrli
zhrli

💬
Najeeb Kazmi
Najeeb Kazmi

💬
TQRTQ
TQRTQ

💬
Viktor Bordiuzha
Viktor Bordiuzha

💡
agijsberts
agijsberts

💻
Mews
Mews

⚠️

This project follows the all-contributors specification. Contributions of any kind welcome!

skillful_nowcasting's People

Contributors

agijsberts avatar allcontributors[bot] avatar hchen19 avatar jacobbieker avatar johmathe avatar mews avatar peterdudfield avatar pre-commit-ci[bot] avatar victor30608 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

skillful_nowcasting's Issues

Wrong dimension to split when training Discriminator

It seems that there are some bugs when training the discriminator.

The score of real inputs and generated inputs are splitted from "concatenated_outputs" along dimension 1.

score_real, score_generated = torch.split(concatenated_outputs, 1, dim=1)

However, the discriminator concat the spatial loss and temporal loss along dimension 1, forming "concatenated_outputs".
return torch.cat([spatial_loss, temporal_loss], dim=1)

Therefore, "score_real" and "score_generated" actually represent the spatial loss and temporal loss instead of what you desire.
I will appreciate it if you can tell me your opinion :)

Pre-Train Model

This GAN is a bit tricky to train, and I've had trouble so far getting it to learn a lot. The loss just seems to stay very high. This could be because of the more limited computation that I've tried training it on compared to DeepMind, but getting a pre-trained model publicly available would be very helpful

Match Paper implementation closer

Detailed Description

The paper's diagram of the model is this:

Nowcasting_Diagram

The implementation in this repo has three known differences.

  1. There is only a single G block between each ConvGRU
  2. Zsp input channels has to match the input channels from the conditioning stack, instead of the conditioning stack having half the channels of the latent space
  3. The number of channels from the 3x3 conditioning stack outputs does not exactly match the paper, and changes the number of channels to match the diagram.

Context

These differences could be important in training the model. While the current implementation is quite close and almost entirely follows the paper, these differences could be quite important.

Sampler gives NaN values

Describe the bug
The Sampler gives NaN values when being used in the generator.

To Reproduce
Steps to reproduce the behavior:
The unit tests have it happen in test_generator, and test_sampler

Expected behavior
No NaN values should exist in the output

Additional context
From unit tests, each of the layers in the Sampler don't return NaNs, so not sure why it does so. It seems to happen usually after g3 or ConvGRU3 layers.

Trying to execute run.py in train folder renders an error

Hi,
Really great and helpful code!
I was trying to run train.py on the nimrod-uk-1km-test data and encountered the following error, it says "RuntimeError: Serialization of parametrized modules is only supported through state_dict()." I searched on torch's website and found an earlier commit, so I downgraded torch to v1.12.0 but this did not go away.
Torch Link: pytorch/pytorch#69413

Can you guys help in debugging this issue? I am planning to use this on another dataset

Screenshot 2023-02-01 at 5 01 35 PM

**To Reproduce** Steps to reproduce the behavior: 1. installing dependencies 2. execute train/run.py and the above error shows in the terminal

Add Pre-trained weights from DeepMind and upload to HuggingFace

Detailed Description

DeepMind just sorta released the model here: https://github.com/deepmind/deepmind-research/tree/master/nowcasting
The model definition code still doesn't seem to be available, but possibly we can somewhat reverse-engineer from the TF-Hub link? And see what differences there might be, or how to adapt the TF model weights to PyTorch

Context

This would give us a precipitation nowcasting model, which could be quite helpful for forecasting, as well as a pre-trained model to finetune on cloud predictions, or possibly solar predictions directly.

Possible Implementation

Problems with running `./train/run.py` and Concerns with dependency versions

Describe the bug

I am trying to run the ./train/run.py.
I have several issues:

  1. I had to update the
trainer = Trainer(
    max_epochs=1000,
    logger=wandb_logger,
    callbacks=[model_checkpoint],
    gpus=6,
    precision=32,
    # accelerator="tpu", devices=8
)

To

trainer = Trainer(
    max_epochs=1000,
    logger=wandb_logger,
    callbacks=[model_checkpoint],
    precision=32,
    devices=6,
    accelerator="gpu",
)

Due to having the newer version of Pytorch than it is originally was developed on.

  1. I keep encountering this issue, which I feel like comes from miss match from the versions of the dependencies on which the DGMR was developed:
    I can assume that there's an issue with transferring a batch to the GPU device (I am not sure). Let me know if you have any suggestions on how I can verify it.
(venv) ➜  skillful_nowcasting git:(main) ✗ python3 run.py                                                          
wandb: Currently logged in as: rutkovskii (nowcasting-research). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.15.12 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.15.11
wandb: Run data is saved locally in /home/arutkovskii_umass_edu/skillful_nowcasting/wandb/run-20231008_213620-edi3updl
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run glowing-salad-28
wandb: ⭐️ View project at https://wandb.ai/nowcasting-research/dgmr
wandb: 🚀 View run at https://wandb.ai/nowcasting-research/dgmr/runs/edi3updl
/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py:398: UserWarning: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:617: UserWarning: Checkpoint directory /home/arutkovskii_umass_edu/skillful_nowcasting exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name               | Type                     | Params
----------------------------------------------------------------
0 | discriminator_loss | NowcastingLoss           | 0     
1 | grid_regularizer   | GridCellLoss             | 0     
2 | conditioning_stack | ContextConditioningStack | 4.2 M 
3 | latent_stack       | LatentConditioningStack  | 7.2 M 
4 | sampler            | Sampler                  | 42.1 M
5 | generator          | Generator                | 53.6 M
6 | discriminator      | Discriminator            | 44.7 M
----------------------------------------------------------------
98.3 M    Trainable params
0         Non-trainable params
98.3 M    Total params
393.086   Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.
Sanity Checking: 0it [00:00, ?it/s]2023-10-08 21:36:26.395303: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-08 21:36:28.016597: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-08 21:36:32.347232: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Repo card metadata block was not found. Setting CardData to empty.
Too many dataloader workers: 6 (max is dataset.n_shards=1). Stopping 5 dataloader workers.
2023-10-08 21:36:39.281432: W tensorflow/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".
2023-10-08 21:36:39.481086: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
2023-10-08 21:36:39.481129: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: gpu006
2023-10-08 21:36:39.481138: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: gpu006
2023-10-08 21:36:39.481189: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: 525.125.6
2023-10-08 21:36:39.481214: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 525.125.6
2023-10-08 21:36:39.481224: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:309] kernel version seems to match DSO: 525.125.6
Traceback (most recent call last):
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage
    self._run_sanity_check()
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check
    val_loop.run()
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 361, in _evaluation_step
    batch = call._call_strategy_hook(trainer, "batch_to_device", batch, dataloader_idx=dataloader_idx)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 270, in batch_to_device
    return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 332, in _apply_batch_transfer_handler
    batch = self._call_batch_hook("transfer_batch_to_device", batch, device, dataloader_idx)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 321, in _call_batch_hook
    return trainer_method(trainer, hook_name, *args)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 146, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 571, in transfer_batch_to_device
    return move_data_to_device(batch, device)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_fabric/utilities/apply_func.py", line 101, in move_data_to_device
    return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_utilities/core/apply_func.py", line 80, in apply_to_collection
    v = apply_to_collection(
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_utilities/core/apply_func.py", line 51, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_fabric/utilities/apply_func.py", line 95, in batch_to
    data_output = data.to(device, **kwargs)
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 239, in <module>
    trainer.fit(model, datamodule)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 67, in _call_and_handle_interrupt
    trainer._teardown()
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1003, in _teardown
    self.strategy.teardown()
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 498, in teardown
    self.lightning_module.cpu()
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 79, in cpu
    return super().cpu()
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 967, in cpu
    return self._apply(lambda t: t.cpu())
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  [Previous line repeated 3 more times]
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 857, in _apply
    self._buffers[key] = fn(buf)
  File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 967, in <lambda>
    return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: 🚀 View run glowing-salad-28 at: https://wandb.ai/nowcasting-research/dgmr/runs/edi3updl
wandb: ️⚡ View job at https://wandb.ai/nowcasting-research/dgmr/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEwMzkwMzkwNA==/version_details/v6
wandb: Synced 5 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20231008_213620-edi3updl/logs
(venv) ➜  skillful_nowcasting git:(main) ✗ 

  1. In order to run ./train/run.py, should I move the run.py into the root of the repository?

  2. Looks like TensorFlow is used inside of the datasets library from HuggingFace, inside of nimrod-uk-1km.py. Can you also indicate what is the functional version of TF?

If you know, please, tell me the potential source for the error in second comment and if you can, please provide the requirements.txt that includes versions of dependencies from the working code. What would be the version of CUDA?

My dependencies (based on the ones in requirements.txt):

torch==2.1.0
antialiased-cnns==0.3
pytorch-msssim==1.0.0
numpy==1.24.3
torchvision==0.16.0
pytorch-lightning==2.0.9.post0
einops==0.7.0
huggingface-hub==0.17.3
tensorflow==2.13.1

CUDA version: 12.1 of the PyTorch since PyTorch comes with Cuda and cudnn binaries.

Training is fine, but i can not load the trained model

Describe the bug
As I run this code, errors reported:
import torch
from dgmr import DGMR, Sampler, Generator, Discriminator, LatentConditioningStack, ContextConditioningStack

model = DGMR(
forecast_steps=20, #20
input_channels=1,
output_shape=256,
latent_channels=384,
context_channels=192,
num_samples=3,
)

model.sampler = Sampler.from_pretrained("openclimatefix/dgmr-sampler")
model.sampler.forecast_steps = 20
model.sampler.output_shape = 256
model.discriminator = Discriminator.from_pretrained("openclimatefix/dgmr-discriminator")
model.latent_stack = LatentConditioningStack.from_pretrained("openclimatefix/dgmr-latent-conditioning-stack")
model.context_stack = ContextConditioningStack.from_pretrained("openclimatefix/dgmr-context-conditioning-stack")
model.generator = Generator(conditioning_stack=model.context_stack, latent_stack=model.latent_stack, sampler=model.sampler)
model=DGMR.load_from_checkpoint(checkpoint_path='best-v3.ckpt', strict=False)
print(model.config)

To Reproduce
Steps to reproduce the behavior:
RuntimeError: Error(s) in loading state_dict for DGMR:
size mismatch for latent_stack.l_block1.conv_1x1.weight: copying a param with shape torch.Size([16, 8, 1, 1]) from checkpoint, the shape in current model is torch.Size([4, 8, 1, 1]).
...
...
...
Process finished with exit code 1

Expected behavior
Training is fine . but I don't know why i can not load the trained model.

Additional context
Add any other context about the problem here.

Loading BOM data into ram

Hello,

I am trying to train this model on Australia's BOM Radar Data, however, I am having trouble loading the data into memory.

I have 1 year's worth of data in netCDF4 format at time steps of 5 minutes. Each time step is a seperate NC file. The file structure to access the precipitation field at 01/01/2022 at 12:30pm would be: BOM Rain Rate Data 2022 (fodler) > 20220101 (folder) > 20220101_123000.nc (the precipitation field is stored as an array of int64 values in mm/h under a variable called 'rain_rate' in the netCDF file). I have tried the netCDF4 and xarray libraries for python and recieve an OOM error.

The problem is if I was to load the all available 2022 data (~300 days), it would require approximately 180 GB of ram, which I do not have. The netCDF must compress the data as the size of 2022 data on disk is ~5 GB.

How would I go about efficiently loading all this data and passing it into the DGMR?

Thanks for your help.

Lack of application of discriminators in both dgmr.py and pseudocode/train.txt

Hi, I am also interested in the original paper and I'm glad to see recent great progress in this repo due to uploaded pseudocodes.

From what I read, I'm afraid that both pseudocode/train.txt (from DeepMind) and dgmr.py (in this repo) lack applying discriminators before calculating the generator loss.

  • train.txt
def train():
  
  # ...

  gen_sequences = [tf.concat([batch_inputs, x], axis=1) for x in gen_samples]
  # HERE: loss_hinge_gen should accept Tensor containing discriminator scores instead of precipitation themselves.
  gen_disc_loss = loss_hinge_gen(tf.concat(gen_sequences, axis=0))
  • dgmr.py
    def training_step(self, batch, batch_idx):

        # ..

        generated_sequence = [torch.cat([images, x], dim=2) for x in predictions]
        # HERE
        generator_disc_loss = loss_hinge_gen(torch.cat(generated_sequence, dim=0))

Because loss_hinge_gen() applies the whole-Tensor mean function inside, the resulting Tensor appears to have a correct shape.

Thanks.

How to load the old model to new

Like half year ago, I trained this model in a A100. Now, it seems easy to train less params. So my question is as I have checked generator code, there no change at all. But when I load the old model to new, it seems different.
I list the difference of two models.

Is there any suggestion to transfer to a new model efficiently?

old one:
0 | discriminator_loss | NowcastingLoss | 0
1 | grid_regularizer | GridCellLoss | 0
2 | conditioning_stack | ContextConditioningStack | 4.2 M
3 | latent_stack | LatentConditioningStack | 7.2 M
4 | sampler | Sampler | 42.1 M
5 | generator | Generator | 53.6 M
6 | discriminator | Discriminator | 44.7 M

98.3 M Trainable params
0 Non-trainable params
98.3 M Total params
393.086 Total estimated model params size (MB)

new one:

| Name | Type | Params

0 | discriminator_loss | NowcastingLoss | 0
1 | grid_regularizer | GridCellLoss | 0
2 | conditioning_stack | ContextConditioningStack | 1.1 M
3 | latent_stack | LatentConditioningStack | 1.8 M
4 | sampler | Sampler | 10.5 M
5 | generator | Generator | 13.4 M
6 | discriminator | Discriminator | 44.7 M

58.1 M Trainable params
0 Non-trainable params
58.1 M Total params
232.417 Total estimated model params size (MB)


Possible Implementation

A problem occurs when I run distributed training

Epoch 0: 0%| | 0/9057 [00:00<?, ?it/s]Traceback (most recent call last):
File "./train/run.py", line 294, in
trainer.fit(model, datamodule)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run
results = self._run_stage()
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage
return self._run_train()
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train
self.fit_loop.run()
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 90, in advance
outputs = self.manual_loop.run(split_batch, batch_idx)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/manual_loop.py", line 115, in advance
training_step_output = self.trainer._call_strategy_hook("training_step", *step_kwargs.values())
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1766, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 333, in training_step
return self.model.training_step(*args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/dgmr/dgmr.py", line 140, in training_step
self.manual_backward(discriminator_loss)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1351, in manual_backward
self.trainer.strategy.backward(loss, None, None, *args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 168, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward
model.backward(closure_loss, optimizer, *args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1370, in backward
loss.backward(*args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/torch/autograd/init.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1]] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Question about model prediction results

Thank you very for your work. I normalized the data in the preprocessing stage and denormalized the model prediction results to get the real prediction situation. I trained for a certain number of epochs(500), but I still cannot eliminate the grid lines that appear in the model prediction results, as shown in the picture below. I wonder if you have any relevant experience.
DGMR_predict

I appreciate your attention and assistance.

Loading model using DGMR.from_pretrained("openclimatefix/dgmr") fails

Describe the bug

When loading the pretrained model from HuggingFace as per the instructions in README.md I get a KeyError: 'config' exception. The issue is that when following the instructions **model_kwargs will be empty in NowcastingModelHubMixin._from_pretrained, but it then attempts to pass **model_kwargs['config'] to the class constructor in

model = cls(**model_kwargs["config"])

To Reproduce

>>> from dgmr import DGMR
>>> model = DGMR.from_pretrained("openclimatefix/dgmr")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/me/venv/dgmr/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/me/venv/dgmr/lib/python3.11/site-packages/huggingface_hub/hub_mixin.py", line 420, in from_pretrained
    instance = cls._from_pretrained(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/venv/dgmr/git/skillful_nowcasting/dgmr/hub.py", line 154, in _from_pretrained
    model = cls(**model_kwargs["config"])
                  ~~~~~~~~~~~~^^^^^^^^^^
KeyError: 'config'

Expected behavior

I expect to have a variable model with the pretrained DGMR model.

Additional context

One can trivially work around the problem by passing an empty config argument when loading the pretrained model:

model = DGMR.from_pretrained("openclimatefix/dgmr", config={})

so maybe this is just a matter of updating README.md.

Still, I'd argue that the cleaner fix would be to pass the entire **model_kwargs to the class constructor, thus

model = cls(**model_kwargs)

That would also coincide with how HuggingFace themselves do it in https://github.com/huggingface/huggingface_hub/blob/855ee997202e003b80bafd3c02dac65146c89cc4/src/huggingface_hub/hub_mixin.py#L633.

11111111

Describe the bug

A clear and concise description of what the bug is.

To Reproduce

Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

pretrained model prediction result

Hi, I just used the pretrained model you provided to evaluate on the sample dataset, the result doesn't seem to make reasonable prediction, I also tried to train the model by myself, but hasn't good prediction either, has anyone used the pretrained model to predict good result, or have you successfully train the model to make good result?

Issue with README instructions

Could you correct the following part of the Pre-trained Weights in the README.md? (openclimagefix -> openclimatefix)

discriminator = Discriminator.from_pretrained("openclimagefix/dgmr-discriminator")

Best regards

some confusions

1, Should the original radar data be normalized to [0,1]?The paper said the radar data was transformed to rain rate, then is the rain rate be normalized to [0,1] or [-1, 1]?
2, The output channel of conditioning stack is not consistent with the code, for example, for the last layer, 384 vs 768, while the code cannot work when channel is 384.
3, The predicted radars are very similar as the last frame of input data, but without any cloud movement. I don't know why. It seems that the convGRU doesnot work.

Training on other dataset + Error on using run.py

Describe the bug
Hi! Thank you very much for providing this implementation of dgmr. The model and blocks look very organized and straightforward!
However, I have encountered some issues running your code, mostly due to the complexity of the"run.py" code as it is very complicated to understand the logic (Most likely due to the fact that I do not understand how the dataset looks like).
1: Could you please explain a little bit about how you preprocess the dataset? I hope to run the model on my own dataset so I need to prepare it such that it matches the way you preprocess it. (By the way, if I want to visualize any of the data frames of rainfall, what should I do?)
2: I have encountered error when using run.py. The problem is exactly the same with the following issue (#32 (comment)). I have changed the number of GPU to 1, and the problem still remains. I could not find a solution from the previous issue as the conversation looks a bit confusing. Do I have to manually download something from GCP bucket on my machine? If that's the case, what shall I download and how should I use it?

To Reproduce
python run.py

Expected behavior
Error same with #32 (comment) pumps out

Thank you again for your great work.

Different image resolutions train/evaluate

Hi,

thank you for the nice repo!

I have a question regarding the image dimensions. From a talk about the paper, I heard that it is possible to train on smaller crops (256x256) and then during test time using larger image resolutions (such as the entire UK/US dataset). How can I understand this in practice? How would I train the model using smaller images and then produce large maps once trained?

Thank you! :)

Training

Hello,
Your implementation looks amazing. Could you please let me know how many iterations were required(generator steps in total) for the model to fully converge? And how this translates to your specific hardware(how many days)?

try to run train.py but get the bug

Describe the bug

"FileNotFoundError: Couldn't find a dataset script at D:\CODES\skillful_nowcasting-main\train\openclimatefix\nimrod-uk-1km\nimrod-uk-1km.py or any data file in the same directory. Couldn't find 'openclimatefix/nimrod-uk-1km' on the Hugging Face Hub either: FileNotFoundError: Dataset 'openclimatefix/nimrod-uk-1km' doesn't exist on the Hub. If the repo is private or gated, make sure to log in with `huggingface-cli login"

To Reproduce

Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Add carbon costs and model card to HF

Detailed Description

HF has support for adding carbon costs for training a model in the model card. A model card still needs to be generated for this model, but definitely should add the carbon costs too.

Context

OCF works in lowering carbon emissions, so adding this information to our public models helps keep us transparent.

Possible Implementation

how to train using multiple gpus?

hi, dear authors, thank you for sharing you cool model. i am trying to train the model on my datasets to figure it out how it works well.
the problem is that although it can run with 1 gpu (A-100-unique), the gpu capactiy will soon be filled and the train goes very slowly. so i change to 2 gpus, but an error is reported and i cannot fix it. could you please give me some suggestions how to solve it ?

thanks a lot

image

GPU-utils rate: 0~50%

I have run this code by change some parameter。

1 setting channels num smaller.

context_channels =384//4
forecast_steps = 20
forenum = 20
input_channels = 3 #use my dataset

2 Hardware parameters

CPU 9cores Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
GPU 3090*1 memory 256GB
PyTorch 1.10.0
Python 3.8
Cuda 11.3

3 train parameter

    traindataset = MyDataset(datapath=opt.datapath,dataname= "./train/TestAname.npy",presize=opt.prenum,foresize=opt.forenum,inputchannel=opt.input_channels)        
    tran_loader = DataLoader(traindataset, batch_size=opt.batchsize, num_workers=9,pin_memory=True,shuffle=True,prefetch_factor=8,persistent_workers=True)

3 Issue

GPU-utils rate: 0~50%
when use torch.profiler.profile to analyze model
Output information is in the analyze.md.
analyze.md
Any good advice plz?

pretrained model prediction result

Hello, I recently utilized the pretrained model (openclimatefix/dgmr-discriminator) to evaluate the sample dataset. However, the resulting image appears to be grayscale, unlike what was depicted in the assay. Below is the output image. How can I adjust the result to match the expected appearance shown in the assay, like the colorful one?
predicted_frame_2
predicted_frame_3

Dimension issue when training

Describe the bug
dimensions seem incompatible while training

To Reproduce
Here is a small code for a random train/val dataloader:

from pytorch_lightning import Trainer
import dgmr.dgmr as dgmr
import torch

class DS(torch.utils.data.Dataset):
    def __init__(self, bs=32):
        self.ds = torch.rand((bs, 24, 1, 256, 256))

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, idx):
        return (self.ds[idx, 0:4, :, :], self.ds[idx, 4:22, :, :])


train_loader = torch.utils.data.DataLoader(DS(), batch_size=2)
val_loader = torch.utils.data.DataLoader(DS(), batch_size=2)

trainer = Trainer(gpus=1)
model = dgmr.DGMR()

trainer.fit(model, train_loader, val_loader)

Expected behavior
I would expect the training to start

Additional context
This is the output I'm getting:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
"beta1":            0.0
"beta2":            0.999
"context_channels": 384
"conv_type":        standard
"disc_lr":          0.0002
"forecast_steps":   18
"gen_lr":           5e-05
"grid_lambda":      20.0
"input_channels":   1
"latent_channels":  768
"num_samples":      6
"output_shape":     256
"visualize":        False
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py:120: UserWarning: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
  rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name               | Type                     | Params
----------------------------------------------------------------
0 | discriminator_loss | NowcastingLoss           | 0     
1 | grid_regularizer   | GridCellLoss             | 0     
2 | conditioning_stack | ContextConditioningStack | 4.2 M 
3 | latent_stack       | LatentConditioningStack  | 7.2 M 
4 | sampler            | Sampler                  | 42.1 M
5 | generator          | Generator                | 53.6 M
6 | discriminator      | Discriminator            | 44.7 M
----------------------------------------------------------------
98.3 M    Trainable params
0         Non-trainable params
98.3 M    Total params
393.086   Total estimated model params size (MB)
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py:133: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py:433: UserWarning: The number of training samples (16) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  f"The number of training samples ({self.num_training_batches}) is smaller than the logging interval"
Epoch 0:   0%|                                                                                                                                                                                                            | 0/16 [00:00<?, ?it/s]/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:1806: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
  warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
torch.Size([2, 4, 1, 256, 256])
torch.Size([2, 18, 1, 256, 256])
Traceback (most recent call last):
  File "train.py", line 22, in <module>
    trainer.fit(model, train_loader, val_loader)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 193, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 90, in advance
    outputs = self.manual_loop.run(split_batch, batch_idx)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/manual_loop.py", line 111, in advance
    training_step_output = self.trainer.accelerator.training_step(step_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 216, in training_step
    return self.training_type_plugin.training_step(*step_kwargs.values())
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 213, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/home/ubuntu/skillful_nowcasting/dgmr/dgmr.py", line 130, in training_step
    generated_sequence = torch.cat([images, predictions], dim=2)
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 4 but got size 18 for tensor number 1 in the list.

What also made me tick is the following comment:

# Cat along time dimension [B, C, T, H, W]

While the encoder seems to expect (B, T, C, H, W) - at least this is what I am seeing in the test file.

TypeError: __init__() missing 2 required positional arguments: 'node_def' and 'op'

Describe the bug
Hi,
Thank you for sharing your implementation of DGMR. I'm new to deep learning, but I'm very interested in it and learning to use it in atmospheric science.
When I run the code using the run.py under the train directory, I got the following message:
...
...
98.3 M Trainable params
0 Non-trainable params
98.3 M Total params
393.086 Total estimated model params size (MB)

Sanity Checking: 0it [00:00, ?it/s]2022-07-21 01:24:47.641350: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.641881: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.641954: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.644718: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.646172: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.656873: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
Traceback (most recent call last):
File "run.py", line 205, in
trainer.fit(model, datamodule)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
self._run_sanity_check()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
val_loop.run()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 112, in advance
batch = next(data_fetcher)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 259, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 273, in _fetch_next_batch
batch = next(iterator)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1347, in _next_data
return self._process_data(data)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1373, in _process_data
data.reraise()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/_utils.py", line 454, in reraise
raise self.exc_type(message=msg)
TypeError: init() missing 2 required positional arguments: 'node_def' and 'op'
...
...
Please see the attached run.log file for full log message.

To Reproduce
Steps to reproduce the behavior:

  1. Go to the train directory;
  2. eidt run.py. Since I'm using CPU, so I changed the accelerator to "CPU".
    trainer = Trainer(
    max_epochs=1000,
    logger=wandb_logger,
    callbacks=[model_checkpoint],

gpus=6,

precision=32,
accelerator="cpu"
  1. run "python run.py"

Expected behavior
I'm not sure if I have done it in a right way to train the model using the radar data in the paper and how to use multiple cpus. The README.md file make it very clear about how to install the model and run it in a simple way. It may be nice to have a very small sample of train/val/test data of radar with the code or provide a link to download the train/val/test data manually since it would be very helpful to see what the data really like and to understand the model.

Additional context
I attached the entire log file "run.log" and the packages I used just in case.
run.log
pip_list.txt

Big GPU footprint while training

Describe the bug
Not really a bug per se, more of a question/clarification request. If there are better avenues to discuss (discord server?) these issues I apologize for using the wrong channel.

To Reproduce
Feel free to run this test:
https://github.com/openclimatefix/skillful_nowcasting/blob/main/tests/test_model.py#L305

The training takes about 40G on the GPU with a batch size of 1. Had to upgrade to a A100 to be able to run decently.

Just curious if this is expected behavior or if you recommend another approach.

Issue with README instructions

Describe the bug
README instructions fail when initializing generator

To Reproduce

In [1]: from dgmr import DGMR, Sampler, Generator, Discriminator, LatentConditioningStack, ContextConditioningStack
   ...: model = DGMR().from_pretrained("openclimatefix/dgmr")
   ...: sampler = Sampler().from_pretrained("openclimatefix/dgmr-sampler")
   ...: generator = Generator().from_pretrained("openclimatefix/dgmr-generator")
   ...: discriminator = Discriminator().from_pretrained("openclimagefix/dgmr-discriminator")
   ...: latent_stack = LatentConditioningStack().from_pretrained("openclimatefix/dgmr-latent-conditioning-stack")
   ...: context_stack = ContextConditioningStack().from_pretrained("openclimatefix/dgmr-context-conditioning-stack")
"beta1":            0.0
"beta2":            0.999
"context_channels": 384
"conv_type":        standard
"disc_lr":          0.0002
"forecast_steps":   18
"gen_lr":           5e-05
"grid_lambda":      20.0
"input_channels":   1
"latent_channels":  768
"num_samples":      6
"output_shape":     256
"visualize":        False
"beta1":            0.0
"beta2":            0.999
"context_channels": 384
"conv_type":        standard
"disc_lr":          0.0002
"forecast_steps":   18
"gen_lr":           5e-05
"grid_lambda":      20.0
"input_channels":   1
"latent_channels":  768
"num_samples":      6
"output_shape":     256
"visualize":        False
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-05b2ce154150> in <module>
      2 model = DGMR().from_pretrained("openclimatefix/dgmr")
      3 sampler = Sampler().from_pretrained("openclimatefix/dgmr-sampler")
----> 4 generator = Generator().from_pretrained("openclimatefix/dgmr-generator")
      5 discriminator = Discriminator().from_pretrained("openclimagefix/dgmr-discriminator")
      6 latent_stack = LatentConditioningStack().from_pretrained("openclimatefix/dgmr-latent-conditioning-stack")

TypeError: __init__() missing 3 required positional arguments: 'conditioning_stack', 'latent_stack', and 'sampler'

Expected behavior
I would expect the models to load.

Additional context
N/A

Fix CI testing

Current CI tests dont work. Would someone be able to fix this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.