Hello there.
I've been trying to run the training for a few hours. My specs are:
Nvidia RTX 2070, 32GB of Ram and a ryzen 3700X.
Fedora 36 with proprietary Nvidia drivers (510.68.02 CUDA Version: 11.6).
start training
Traceback (most recent call last):
File "/home/djouze/dev/HydraNet-WikiSQL/main.py", line 70, in <module>
cur_loss = model.train_on_batch(batch)
File "/home/djouze/dev/HydraNet-WikiSQL/modeling/torch_model.py", line 41, in train_on_batch
batch_loss = torch.mean(self.model(**batch)["loss"])
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/dev/HydraNet-WikiSQL/modeling/torch_model.py", line 114, in forward
bert_output, pooled_output = self.base_model(
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 815, in forward
encoder_outputs = self.encoder(
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 508, in forward
layer_outputs = layer_module(
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 395, in forward
self_attention_outputs = self.attention(
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 323, in forward
self_outputs = self.self(
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 253, in forward
attention_probs = self.dropout(attention_probs)
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/modules/dropout.py", line 58, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/home/djouze/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 1279, in dropout
return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 7.79 GiB total capacity; 6.05 GiB already allocated; 158.19 MiB free; 6.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My last try was using the docker image, after the training started, my memory consumption got to 31GB, but the docker seems to have crashed without producing any output.
Could you help me find out why that's happening? I assume it would be able to run on single GPU spec, although in that case it would likely take more time (correct me if I'm wrong).