GithubHelp home page GithubHelp logo

Comments (10)

wangyuxinwhy avatar wangyuxinwhy commented on July 29, 2024

现在有两种解决方案

第一种,不使用 DDP 而是使用 FSDP,accelerate config 如下

# single-node-fsdp.yml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: NO_PREFETCH
  fsdp_offload_params: false
  fsdp_sharding_strategy: 3
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: BertLayer
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

执行训练

accelerate launch --config_file single-node-fsdp.yml finetune_jsonl.py

第二种,继续使用 DDP 但是传入 find_unused_parameters 参数 (这个麻烦一点)

你需要先更新到 uniem 最新的代码(看到这个 issue 新增加的功能)

git clone https://github.com/wangyuxinwhy/uniem.git
pip install -e .

修改 finetune_jsonl.py 的脚本

import pandas as pd
from accelerate import DistributedDataParallelKwargs
from uniem.finetuner import FineTuner

# 读取 jsonl 文件
df = pd.read_json('example_data/riddle.jsonl', lines=True)
# 重新命名
df = df.rename(columns={'instruction': 'text', 'output': 'text_pos'})
# 指定训练的模型为 m3e-small
finetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=df.to_dict('records'))
finetuner.run(
    epochs=1, 
    output_dir='finetuned-model-riddle',
    accelerator_kwargs={
        'kwargs_handlers': [DistributedDataParallelKwargs(find_unused_parameters=True)]
    }
)

执行训练

accelerate launch --config_file single-node-ddp.yml finetune_jsonl.py

我推荐使用 fsdp 的解决方案

from uniem.

qianzhang2018 avatar qianzhang2018 commented on July 29, 2024

谢谢您的答复和解决方案~
我分别都是尝试了一下这两个方案,好像都有点问题。
方案一:
会直接报错:
ValueError: Expected embedder.encoder.encoder.layer.0 to NOT be FullyShardedDataParallel if using an auto_wrap_policy
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 27652 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27651) of binary: /home/aipf/work/zq_vec/env/bin/python
方案二:
在我更新了您的库后,使用riddle的数据,运行accelerate launch finetune_jsonl.py 能够顺利执行,并且保存了模型。
然后我就把样例数据改成我自己的数据(数据的制造方式和riddle.jsonl相同),但是我的数据比较大。接着用相同的方法去运行,它会先去寻找最合适的batchsize,然后我就看到两张显卡显存都吃满了,但是训练进度条一直没有出现,好像卡死了一样,我再ctrl+c以后,它报了warring。
WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26566 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26567 closing signal SIGINT
^CWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26566 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26567 closing signal SIGTERM

from uniem.

qianzhang2018 avatar qianzhang2018 commented on July 29, 2024

方案二在手动指定一个比较小的batsize后就没问题了

from uniem.

wangyuxinwhy avatar wangyuxinwhy commented on July 29, 2024

感谢反馈,第一个方案我自己测试之后是没有问题的,不知道在你那里具体是什么问题。第二个方案可能 auto_batch 有 bug, 没有办法兼容 DDP ,我需要看看怎么修一下~

from uniem.

qianzhang2018 avatar qianzhang2018 commented on July 29, 2024

第一个方案在小数据集的情况是没问题的。大数据集的情况还是有问题。
问题还是在batch_size上。第一个方案,在我也自己指定batch_size后也仍然就可以了。

最后还是感谢您的开源代码和解决方案~谢谢大佬!

from uniem.

wangyuxinwhy avatar wangyuxinwhy commented on July 29, 2024

好的,感谢反馈

from uniem.

windar427 avatar windar427 commented on July 29, 2024

使用FSDP finetuner
TypeError: FineTuner.run() got an unexpected keyword argument 'accelerator_kwargs'

from uniem.

wangyuxinwhy avatar wangyuxinwhy commented on July 29, 2024

使用FSDP finetuner TypeError: FineTuner.run() got an unexpected keyword argument 'accelerator_kwargs'

使用 FSDP 的话,可以直接把 accelerator_kwargs 这个参数去掉

from uniem.

1006076811 avatar 1006076811 commented on July 29, 2024

使用FSDP finetuner TypeError: FineTuner.run() got an unexpected keyword argument 'accelerator_kwargs'

使用 FSDP 的话,可以直接把 accelerator_kwargs 这个参数去掉

你好,使用第一种方式batch_size设置成8,可以在4个3090成功运行,但是loss一直是Nan,请问可能是什么原因呢。

from uniem.

wangyuxinwhy avatar wangyuxinwhy commented on July 29, 2024

可以先确认一下在非分布式训练的时候,是不是也是 Nan。loss 是 Nan 的情况可能由多种原因导致,需要一步一步 debug ,我的经验也不是很多,所以没有办法给出解决问题的方法。我之前遇到这种情况都是二分法一步一步 debug 的。

from uniem.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.