I was training LLaVA model using deepspeed zero3. What I want to do is continually tra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[BUG] Zero3 causes AttributeError: 'NoneType' object has no attribute 'numel' in continual training about deepspeed HOT 4 CLOSED

thkimYonsei commented on August 24, 2024

[BUG] Zero3 causes AttributeError: 'NoneType' object has no attribute 'numel' in continual training

from deepspeed.

Comments (4)

Xirid commented on August 24, 2024

I have the same issue and solved it by just reloading the whole model on every iteration.
Though now I am getting oom due to the memory not freeing after the second epoch, but I guess that is a different issue.

from deepspeed.

tjruwase commented on August 24, 2024

@Xirid, please try the following API to free engine memory
https://deepspeed.readthedocs.io/en/latest/zero3.html#gpu-memory-management

from deepspeed.

apToll commented on August 24, 2024

我也遇到了在 for 循环中创建新的数据集和新的训练器，然后调用trainer.train()。
在 for 循环的第一次迭代中，训练工作正常。然而，在第二次迭代中，报错为错误： AttributeError: 'NoneType' object has no attribute 'numel'
尝试释放GPU，但没有用，释放GPU如下：# Free GPU memory consumed by model parameters
ds_engine.empty_partition_cache()

from deepspeed.

tjruwase commented on August 24, 2024

Closing this issue due to lack of response. Please reopen if needed.

from deepspeed.

[BUG] Zero3 causes AttributeError: 'NoneType' object has no attribute 'numel' in continual training about deepspeed HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs