tum-di-lab-graph-scaling / ocp Goto Github PK
View Code? Open in Web Editor NEWThis project forked from fair-chem/fairchem
https://opencatalystproject.org/
License: MIT License
This project forked from fair-chem/fairchem
https://opencatalystproject.org/
License: MIT License
During training with ZeRO stage 3 enabled in the Deepspeed config, following warnings/errors occur:
[WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch_geometric.data.batch.DataBatch'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Traceback (most recent call last):
File "/home/dstoll/ocp/main.py", line 126, in <module>
Runner()(config)
File "/home/dstoll/ocp/main.py", line 66, in __call__
self.task.run()
File "/home/dstoll/ocp/ocpmodels/tasks/task.py", line 56, in run
raise e
File "/home/dstoll/ocp/ocpmodels/tasks/task.py", line 49, in run
self.trainer.train(
File "/home/dstoll/ocp/ocpmodels/trainers/forces_trainer.py", line 329, in train
self._backward(loss)
File "/home/dstoll/ocp/ocpmodels/trainers/base_trainer.py", line 716, in _backward
self.model.backward(loss)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1726, in backward
self.optimizer.backward(loss)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2536, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 51, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: The expanded size of the tensor (256) must match the existing size (0) at non-singleton dimension 1. Target sizes: [73085, 256]. Tensor sizes: [0]
I tried getting DeepSpeed running the s2ef task running with the cgcnn model (using my latest commit on the deepspeed
branch). Using the code as is (i.e. using the plain DeepSpeed trainer without any optimization) works.
However, using the following DeepSpeed config file:
{
"train_batch_size": 32,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0005
}
},
"fp16": {
"enabled": true
},
"zero_optimization": false
}
where the fp16 optimization is enabled, and running the job on one GPU for now as follows:
(ocp-models) [dherbst@kanon ocp]$ python -u -m torch.distributed.launch --nproc_per_node=1 main.py --distributed --num-gpus 1 --mode train --config-yml configs/s2ef/200k/cgcnn/cgcnn.yml --deepspeed-mode deepspeed-optimizer --deepspeed-config configs/s2ef/200k/cgcnn/ds_config.json
results in the following error:
Traceback (most recent call last):
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
Runner()(config)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
self.task.run()
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 56, in run
raise e
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
self.trainer.train(
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 330, in train
out = self._forward(batch)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 432, in _forward
out_energy, out_forces = self.model(batch_list)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1616, in forward
loss = self.module(*inputs, **kwargs)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/common/data_parallel.py", line 59, in forward
return self.module(batch_list[0].to(f"cuda:{self.device_ids[0]}"))
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/models/cgcnn.py", line 165, in forward
energy = self._forward(data)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/common/utils.py", line 121, in cls_method
return f(self, *args, **kwargs)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/models/cgcnn.py", line 154, in _forward
mol_feats = self._convolve(data)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/models/cgcnn.py", line 185, in _convolve
node_feats = self.embedding_fc(data.x)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: expected scalar type Half but found Float
I tried getting DeepSpeed running the s2ef task running with the cgcnn model (using my latest commit on the deepspeed
branch). Using the code as is (i.e. using the plain DeepSpeed trainer without any optimization) works.
However, using the following DeepSpeed config file:
{
"train_batch_size": 32,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0005
}
},
"fp16": {
"enabled": false
},
"zero_optimization": true
}
where the ZeRO optimization is enabled, and running the job on one GPU for now as follows:
(ocp-models) [dherbst@kanon ocp]$ python -u -m torch.distributed.launch --nproc_per_node=1 main.py --distributed --num-gpus 1 --mode train --config-yml configs/s2ef/200k/cgcnn/cgcnn.yml --deepspeed-mode deepspeed-optimizer --deepspeed-config configs/s2ef/200k/cgcnn/ds_config.json
results in the following error:
Traceback (most recent call last):
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
Runner()(config)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
self.task.run()
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
self.trainer.train(
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 333, in train
self._backward(loss)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/base_trainer.py", line 741, in _backward
self.optimizer.step()
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1660, in step
self.check_overflow()
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1919, in check_overflow
self._check_overflow(partition_gradients)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1820, in _check_overflow
self.overflow = self.has_overflow(partition_gradients)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1839, in has_overflow
overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1832, in has_overflow_partitioned_grads_serial
for j, grad in enumerate(self.averaged_gradients[i]):
KeyError: 0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.