trainer = Trainer(
max_epochs=1000,
logger=wandb_logger,
callbacks=[model_checkpoint],
gpus=6,
precision=32,
# accelerator="tpu", devices=8
)
trainer = Trainer(
max_epochs=1000,
logger=wandb_logger,
callbacks=[model_checkpoint],
precision=32,
devices=6,
accelerator="gpu",
)
Due to having the newer version of Pytorch than it is originally was developed on.
(venv) ➜ skillful_nowcasting git:(main) ✗ python3 run.py
wandb: Currently logged in as: rutkovskii (nowcasting-research). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.15.12 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.15.11
wandb: Run data is saved locally in /home/arutkovskii_umass_edu/skillful_nowcasting/wandb/run-20231008_213620-edi3updl
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run glowing-salad-28
wandb: ⭐️ View project at https://wandb.ai/nowcasting-research/dgmr
wandb: 🚀 View run at https://wandb.ai/nowcasting-research/dgmr/runs/edi3updl
/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py:398: UserWarning: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:617: UserWarning: Checkpoint directory /home/arutkovskii_umass_edu/skillful_nowcasting exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
| Name | Type | Params
----------------------------------------------------------------
0 | discriminator_loss | NowcastingLoss | 0
1 | grid_regularizer | GridCellLoss | 0
2 | conditioning_stack | ContextConditioningStack | 4.2 M
3 | latent_stack | LatentConditioningStack | 7.2 M
4 | sampler | Sampler | 42.1 M
5 | generator | Generator | 53.6 M
6 | discriminator | Discriminator | 44.7 M
----------------------------------------------------------------
98.3 M Trainable params
0 Non-trainable params
98.3 M Total params
393.086 Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.
Sanity Checking: 0it [00:00, ?it/s]2023-10-08 21:36:26.395303: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-08 21:36:28.016597: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-08 21:36:32.347232: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Repo card metadata block was not found. Setting CardData to empty.
Too many dataloader workers: 6 (max is dataset.n_shards=1). Stopping 5 dataloader workers.
2023-10-08 21:36:39.281432: W tensorflow/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".
2023-10-08 21:36:39.481086: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
2023-10-08 21:36:39.481129: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: gpu006
2023-10-08 21:36:39.481138: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: gpu006
2023-10-08 21:36:39.481189: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: 525.125.6
2023-10-08 21:36:39.481214: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 525.125.6
2023-10-08 21:36:39.481224: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:309] kernel version seems to match DSO: 525.125.6
Traceback (most recent call last):
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
results = self._run_stage()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage
self._run_sanity_check()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check
val_loop.run()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 361, in _evaluation_step
batch = call._call_strategy_hook(trainer, "batch_to_device", batch, dataloader_idx=dataloader_idx)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 270, in batch_to_device
return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 332, in _apply_batch_transfer_handler
batch = self._call_batch_hook("transfer_batch_to_device", batch, device, dataloader_idx)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 321, in _call_batch_hook
return trainer_method(trainer, hook_name, *args)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 146, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 571, in transfer_batch_to_device
return move_data_to_device(batch, device)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_fabric/utilities/apply_func.py", line 101, in move_data_to_device
return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_utilities/core/apply_func.py", line 80, in apply_to_collection
v = apply_to_collection(
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_utilities/core/apply_func.py", line 51, in apply_to_collection
return function(data, *args, **kwargs)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_fabric/utilities/apply_func.py", line 95, in batch_to
data_output = data.to(device, **kwargs)
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 239, in <module>
trainer.fit(model, datamodule)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 67, in _call_and_handle_interrupt
trainer._teardown()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1003, in _teardown
self.strategy.teardown()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 498, in teardown
self.lightning_module.cpu()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 79, in cpu
return super().cpu()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 967, in cpu
return self._apply(lambda t: t.cpu())
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 857, in _apply
self._buffers[key] = fn(buf)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 967, in <lambda>
return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: 🚀 View run glowing-salad-28 at: https://wandb.ai/nowcasting-research/dgmr/runs/edi3updl
wandb: ️⚡ View job at https://wandb.ai/nowcasting-research/dgmr/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEwMzkwMzkwNA==/version_details/v6
wandb: Synced 5 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20231008_213620-edi3updl/logs
(venv) ➜ skillful_nowcasting git:(main) ✗
If you know, please, tell me the potential source for the error in second comment and if you can, please provide the requirements.txt
that includes versions of dependencies from the working code. What would be the version of CUDA?
torch==2.1.0
antialiased-cnns==0.3
pytorch-msssim==1.0.0
numpy==1.24.3
torchvision==0.16.0
pytorch-lightning==2.0.9.post0
einops==0.7.0
huggingface-hub==0.17.3
tensorflow==2.13.1
CUDA version: 12.1 of the PyTorch since PyTorch comes with Cuda and cudnn binaries.