alibaba / pai-megatron-patch Goto Github PK

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.

License: Apache License 2.0

Python 98.63% Shell 1.37%

pai-megatron-patch's Introduction

Quick Start

	Megatron-LM-Dense	Megatron-Core-Dense	Megatron-Core-MoE	MegaBlocks-MoE
LLama3	ReadMe	ReadMe	N/A	N/A
LLama2	ReadMe	ReadMe	N/A	N/A
Mistral	ReadMe	ReadMe	ReadMe	N/A
Qwen1.5	ReadMe	ReadMe	ReadMe	ReadMe

Introduction

English | 简体中文

Pai-Megatron-Patch (https://github.com/alibaba/Pai-Megatron-Patch) is a deep learning training toolkit built for developers to train and predict LLMs & VLMs by using Megatron framework easily. With the continuous development of LLMs, the model structure and scale are rapidly evolving. Although these models can be conveniently manufactured using Transformers or DeepSpeed training framework, the training efficiency is comparably low. This phenomenon becomes even severer when the model scale exceeds 10 billion. The primary objective of Pai-Megatron-Patch is to effectively utilize the computational power of GPUs for LLM. This tool allows convenient training of commonly used LLM with all the accelerating techniques provided by Megatron-LM.

What's New:

Support training qwen1.5-moe models by using Megatron-Core. [🔥🔥 2024.05.13]
Support training llama3 models by using Megatron-LM and Megatron-Core. [🔥🔥 2024.04.21]
Support training qwen1.5 models by using Megatron-Core. [🔥🔥 2024.03.20]
Support training qwen1.5 models by using Megatron-LM. [🔥🔥 2024.02.28]
Support training mixtral-8x7b moe model by using Megatron-Core. [🔥🔥 2024.01.26]
Support training qwen-vl multimodel by using Megatron-LM. [🔥🔥 2023.12.15]
Support training LLava multimodel by using Megatron-LM. [🔥🔥 2023.12.01]
Support training deepseek model by using Megatron-LM. [🔥🔥 2023.11.24]
Support training qwen-72B model by using Megatron-LM. [🔥🔥 2023.11.23]
Support training Mistral-7B, Yi-6B and Codellama-34B [🔥🔥 2023.11.16]
Upgrade Megatron-LM for Llama2, qwen and baichuan2 to use transformer engine and fp8. [🔥🔥 2023.10.19]
Support training qwen-14B and baichuan2-13B model by using Megatron-LM. [🔥🔥 2023.10.08]

Highlights

Pai-Megatron-Patch is developed by the Alibaba Cloud Machine Learning Platform (PAI) algorithm team. The tool aims to assist developers in quickly getting started with Lingjun products and completing the entire development pipeline for LLM, including efficient distributed training, supervised fine-tuning, and offline model inference or verification. It has several merits as follows:

Support for multiple commonly used LLM such as llama, llama-2, codellama, deepseek, baichuan, qwen, Falcon, GLM, Starcoder, Bloom, chatglm, etc.
Support for model weight conversion: Mapping operator namespaces between Huggingface, Megatron, and Transformer Engine.
Support for FP8 training acceleration in Flash Attention 2.0 and Transformer Engine modes, ensuring training convergence.
Rich and user-friendly usage examples, offering best practices for the entire workflow of LLM pre-training, fine-tuning, evaluation, and inference, as well as reinforcement learning.

Framework

The design philosophy of Pai-Megatron-Patch is to avoid invasive modifications to the source code of Megatron-LM. In other words, it does not add new modules directly to Megatron-LM. Instead, the functions that need expansion and improvement are presented in the form of patch. This decoupling ensures that users can continue to embrace the best practices of LLM without being affected by upgrades of Megatron-LM.

Pai-Megatron-Patch includes key components for building LLM training, such as model library, tokenizers, model convertors, reinforcement learning , offline text generation, usages examples, and toolkits. The model library provides popular LLMs implemented in Megatron, such as baichuan, bloom, chatglm, falcon, galactica, glm, llama, qwen, and starcoder. More Megatron-based implementations of LLMs will be added as needed in the future. Additionally, the patch provides bidirectional conversion between Huggingface and Megatron model weights. This allows users to easily utilize Huggingface pretrained models for continued pre-training or fine-tuning in Megatron, as well as evaluating model quality using Huggingface's evaluation/inference pipelines on trained Megatron models.

In the reinforcement learning section, the patch offers PPO training workflows, enabling users to perform reinforcement learning with SFT models and RM models. Finally, the patch provides numerous usage examples to help users quickly start LLMs training and offline inference. For specific usage processes within Alibaba Cloud Lingjun products, please refer to the following link: PAI-Lingjun Intelligent Computing Service LLM solution.

Technical Reports

Contact

Use Dingtalk to scan blow QR code

License

This project is licensed under the Apache License (Version 2.0). This toolkit also contains some code modified from other repos under other open-source licenses. See the NOTICE file for more information.

pai-megatron-patch's People

Contributors

Stargazers

Watchers

pai-megatron-patch's Issues

Needs requirements.txt

Hi, would you release requirements.txt for this project?
I have meet Megatron and flash-attn version problem.

either train-iters or train-samples should be provided

将llama2-7b-hf转为fp8精度的megatron权重，然后加载权重进行微调，报如题错误，run_finetune_megatron_llama.sh确实没有指定该参数

megatron转huggingface报错

我将huggingface格式的qwen1.5-0.5B模型转为megatron格式之后，在megatron框架下进行训练。在将训练得到的权重转为huggingface时出现了以下错误：

2024-03-18 11:09:50.106230: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-18 11:09:50.109118: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-18 11:09:50.142500: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-18 11:09:50.142556: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-18 11:09:50.142599: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-18 11:09:50.149803: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-18 11:09:50.920300: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading Megatron-LM checkpoint arguments from: /mnt/raid/xujianjun/LLM_train/wjk/Megatron-LM/checkpoints/Qwen-1.5-0.5B_1/release/mp_rank_00/model_optim_rng.pt
Zarr-based strategies will not be registered because of missing packages
cp: target '/mnt/raid/xujianjun/LLM_train/wjk/Megatron-LM/checkpoints/models--Qwen--Qwen1.5-0.5B-mg2hf41-aftertrain1_test/' is not a directory
cp: cannot stat '/mnt/raid/xujianjun/LLM_train/wjk/Megatron-LM/checkpoints/Qwen-1.5-0.5B_1/tokenizer.model': No such file or directory
Converting
Converting embeddings
Converting transformer layers
Converting pipeline parallel rank 0
model.layers.0 input_norm weight
model.layers.0 self_attention.query_key_value weight
model.layers.0 self_attention.query_key_value bias
model.layers.0 self_attention query
Traceback (most recent call last):
  File "/mnt/raid/xujianjun/LLM_train/wjk/Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/qwen/checkpoint_reshaping_and_interoperability_qwen1.5.py", line 994, in <module>
    main()
  File "/mnt/raid/xujianjun/LLM_train/wjk/Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/qwen/checkpoint_reshaping_and_interoperability_qwen1.5.py", line 988, in main
    convert_checkpoint_from_megatron_to_transformers(args)
  File "/mnt/raid/xujianjun/LLM_train/wjk/Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/qwen/checkpoint_reshaping_and_interoperability_qwen1.5.py", line 843, in convert_checkpoint_from_megatron_to_transformers
    params = val.to(dtype)
             ^^^^^^
AttributeError: 'NoneType' object has no attribute 'to'

然后我将报错的层用‘error’为前缀打印出来，结果如下：

2024-03-18 11:10:46.456280: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-18 11:10:46.459179: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-18 11:10:46.492705: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-18 11:10:46.492756: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-18 11:10:46.492795: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-18 11:10:46.499963: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-18 11:10:47.273311: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading Megatron-LM checkpoint arguments from: /mnt/raid/xujianjun/LLM_train/wjk/Megatron-LM/checkpoints/Qwen-1.5-0.5B_1/release/mp_rank_00/model_optim_rng.pt
Zarr-based strategies will not be registered because of missing packages
cp: target '/mnt/raid/xujianjun/LLM_train/wjk/Megatron-LM/checkpoints/models--Qwen--Qwen1.5-0.5B-mg2hf41-aftertrain1_test/' is not a directory
cp: cannot stat '/mnt/raid/xujianjun/LLM_train/wjk/Megatron-LM/checkpoints/Qwen-1.5-0.5B_1/tokenizer.model': No such file or directory
Converting
Converting embeddings
Converting transformer layers
Converting pipeline parallel rank 0
model.layers.0 input_norm weight
model.layers.0 self_attention.query_key_value weight
model.layers.0 self_attention.query_key_value bias
model.layers.0 self_attention query
error: self_attention.query
model.layers.0 self_attention.dense weight
model.layers.0 self_attention dense
error: self_attention.dense
model.layers.0 post_attention_norm weight
model.layers.0 mlp.dense_h_to_4h weight
model.layers.0 mlp dense
error: mlp.dense
model.layers.0 mlp.dense_4h_to_h weight
model.layers.0 mlp dense
error: mlp.dense
model.layers.1 input_norm weight
model.layers.1 self_attention.query_key_value weight
model.layers.1 self_attention.query_key_value bias
model.layers.1 self_attention query
error: self_attention.query
model.layers.1 self_attention.dense weight
model.layers.1 self_attention dense
error: self_attention.dense
model.layers.1 post_attention_norm weight
model.layers.1 mlp.dense_h_to_4h weight
model.layers.1 mlp dense
error: mlp.dense
model.layers.1 mlp.dense_4h_to_h weight
model.layers.1 mlp dense
error: mlp.dense
model.layers.2 input_norm weight
model.layers.2 self_attention.query_key_value weight
model.layers.2 self_attention.query_key_value bias
model.layers.2 self_attention query
error: self_attention.query
model.layers.2 self_attention.dense weight
model.layers.2 self_attention dense
error: self_attention.dense
model.layers.2 post_attention_norm weight
model.layers.2 mlp.dense_h_to_4h weight
model.layers.2 mlp dense
error: mlp.dense
model.layers.2 mlp.dense_4h_to_h weight
model.layers.2 mlp dense
error: mlp.dense
model.layers.3 input_norm weight
model.layers.3 self_attention.query_key_value weight
model.layers.3 self_attention.query_key_value bias
model.layers.3 self_attention query
error: self_attention.query
model.layers.3 self_attention.dense weight
model.layers.3 self_attention dense
error: self_attention.dense
model.layers.3 post_attention_norm weight
model.layers.3 mlp.dense_h_to_4h weight
model.layers.3 mlp dense
error: mlp.dense
model.layers.3 mlp.dense_4h_to_h weight
model.layers.3 mlp dense
error: mlp.dense
model.layers.4 input_norm weight
model.layers.4 self_attention.query_key_value weight
model.layers.4 self_attention.query_key_value bias
model.layers.4 self_attention query
error: self_attention.query
model.layers.4 self_attention.dense weight
model.layers.4 self_attention dense
error: self_attention.dense
model.layers.4 post_attention_norm weight
model.layers.4 mlp.dense_h_to_4h weight
model.layers.4 mlp dense
error: mlp.dense
model.layers.4 mlp.dense_4h_to_h weight
model.layers.4 mlp dense
error: mlp.dense
model.layers.5 input_norm weight
model.layers.5 self_attention.query_key_value weight
model.layers.5 self_attention.query_key_value bias
model.layers.5 self_attention query
error: self_attention.query
model.layers.5 self_attention.dense weight
model.layers.5 self_attention dense
error: self_attention.dense
model.layers.5 post_attention_norm weight
model.layers.5 mlp.dense_h_to_4h weight
model.layers.5 mlp dense
error: mlp.dense
model.layers.5 mlp.dense_4h_to_h weight
model.layers.5 mlp dense
error: mlp.dense
model.layers.6 input_norm weight
model.layers.6 self_attention.query_key_value weight
model.layers.6 self_attention.query_key_value bias
model.layers.6 self_attention query
error: self_attention.query
model.layers.6 self_attention.dense weight
model.layers.6 self_attention dense
error: self_attention.dense
model.layers.6 post_attention_norm weight
model.layers.6 mlp.dense_h_to_4h weight
model.layers.6 mlp dense
error: mlp.dense
model.layers.6 mlp.dense_4h_to_h weight
model.layers.6 mlp dense
error: mlp.dense
model.layers.7 input_norm weight
model.layers.7 self_attention.query_key_value weight
model.layers.7 self_attention.query_key_value bias
model.layers.7 self_attention query
error: self_attention.query
model.layers.7 self_attention.dense weight
model.layers.7 self_attention dense
error: self_attention.dense
model.layers.7 post_attention_norm weight
model.layers.7 mlp.dense_h_to_4h weight
model.layers.7 mlp dense
error: mlp.dense
model.layers.7 mlp.dense_4h_to_h weight
model.layers.7 mlp dense
error: mlp.dense
model.layers.8 input_norm weight
model.layers.8 self_attention.query_key_value weight
model.layers.8 self_attention.query_key_value bias
model.layers.8 self_attention query
error: self_attention.query
model.layers.8 self_attention.dense weight
model.layers.8 self_attention dense
error: self_attention.dense
model.layers.8 post_attention_norm weight
model.layers.8 mlp.dense_h_to_4h weight
model.layers.8 mlp dense
error: mlp.dense
model.layers.8 mlp.dense_4h_to_h weight
model.layers.8 mlp dense
error: mlp.dense
model.layers.9 input_norm weight
model.layers.9 self_attention.query_key_value weight
model.layers.9 self_attention.query_key_value bias
model.layers.9 self_attention query
error: self_attention.query
model.layers.9 self_attention.dense weight
model.layers.9 self_attention dense
error: self_attention.dense
model.layers.9 post_attention_norm weight
model.layers.9 mlp.dense_h_to_4h weight
model.layers.9 mlp dense
error: mlp.dense
model.layers.9 mlp.dense_4h_to_h weight
model.layers.9 mlp dense
error: mlp.dense
model.layers.10 input_norm weight
model.layers.10 self_attention.query_key_value weight
model.layers.10 self_attention.query_key_value bias
model.layers.10 self_attention query
error: self_attention.query
model.layers.10 self_attention.dense weight
model.layers.10 self_attention dense
error: self_attention.dense
model.layers.10 post_attention_norm weight
model.layers.10 mlp.dense_h_to_4h weight
model.layers.10 mlp dense
error: mlp.dense
model.layers.10 mlp.dense_4h_to_h weight
model.layers.10 mlp dense
error: mlp.dense
model.layers.11 input_norm weight
model.layers.11 self_attention.query_key_value weight
model.layers.11 self_attention.query_key_value bias
model.layers.11 self_attention query
error: self_attention.query
model.layers.11 self_attention.dense weight
model.layers.11 self_attention dense
error: self_attention.dense
model.layers.11 post_attention_norm weight
model.layers.11 mlp.dense_h_to_4h weight
model.layers.11 mlp dense
error: mlp.dense
model.layers.11 mlp.dense_4h_to_h weight
model.layers.11 mlp dense
error: mlp.dense
model.layers.12 input_norm weight
model.layers.12 self_attention.query_key_value weight
model.layers.12 self_attention.query_key_value bias
model.layers.12 self_attention query
error: self_attention.query
model.layers.12 self_attention.dense weight
model.layers.12 self_attention dense
error: self_attention.dense
model.layers.12 post_attention_norm weight
model.layers.12 mlp.dense_h_to_4h weight
model.layers.12 mlp dense
error: mlp.dense
model.layers.12 mlp.dense_4h_to_h weight
model.layers.12 mlp dense
error: mlp.dense
model.layers.13 input_norm weight
model.layers.13 self_attention.query_key_value weight
model.layers.13 self_attention.query_key_value bias
model.layers.13 self_attention query
error: self_attention.query
model.layers.13 self_attention.dense weight
model.layers.13 self_attention dense
error: self_attention.dense
model.layers.13 post_attention_norm weight
model.layers.13 mlp.dense_h_to_4h weight
model.layers.13 mlp dense
error: mlp.dense
model.layers.13 mlp.dense_4h_to_h weight
model.layers.13 mlp dense
error: mlp.dense
model.layers.14 input_norm weight
model.layers.14 self_attention.query_key_value weight
model.layers.14 self_attention.query_key_value bias
model.layers.14 self_attention query
error: self_attention.query
model.layers.14 self_attention.dense weight
model.layers.14 self_attention dense
error: self_attention.dense
model.layers.14 post_attention_norm weight
model.layers.14 mlp.dense_h_to_4h weight
model.layers.14 mlp dense
error: mlp.dense
model.layers.14 mlp.dense_4h_to_h weight
model.layers.14 mlp dense
error: mlp.dense
model.layers.15 input_norm weight
model.layers.15 self_attention.query_key_value weight
model.layers.15 self_attention.query_key_value bias
model.layers.15 self_attention query
error: self_attention.query
model.layers.15 self_attention.dense weight
model.layers.15 self_attention dense
error: self_attention.dense
model.layers.15 post_attention_norm weight
model.layers.15 mlp.dense_h_to_4h weight
model.layers.15 mlp dense
error: mlp.dense
model.layers.15 mlp.dense_4h_to_h weight
model.layers.15 mlp dense
error: mlp.dense
model.layers.16 input_norm weight
model.layers.16 self_attention.query_key_value weight
model.layers.16 self_attention.query_key_value bias
model.layers.16 self_attention query
error: self_attention.query
model.layers.16 self_attention.dense weight
model.layers.16 self_attention dense
error: self_attention.dense
model.layers.16 post_attention_norm weight
model.layers.16 mlp.dense_h_to_4h weight
model.layers.16 mlp dense
error: mlp.dense
model.layers.16 mlp.dense_4h_to_h weight
model.layers.16 mlp dense
error: mlp.dense
model.layers.17 input_norm weight
model.layers.17 self_attention.query_key_value weight
model.layers.17 self_attention.query_key_value bias
model.layers.17 self_attention query
error: self_attention.query
model.layers.17 self_attention.dense weight
model.layers.17 self_attention dense
error: self_attention.dense
model.layers.17 post_attention_norm weight
model.layers.17 mlp.dense_h_to_4h weight
model.layers.17 mlp dense
error: mlp.dense
model.layers.17 mlp.dense_4h_to_h weight
model.layers.17 mlp dense
error: mlp.dense
model.layers.18 input_norm weight
model.layers.18 self_attention.query_key_value weight
model.layers.18 self_attention.query_key_value bias
model.layers.18 self_attention query
error: self_attention.query
model.layers.18 self_attention.dense weight
model.layers.18 self_attention dense
error: self_attention.dense
model.layers.18 post_attention_norm weight
model.layers.18 mlp.dense_h_to_4h weight
model.layers.18 mlp dense
error: mlp.dense
model.layers.18 mlp.dense_4h_to_h weight
model.layers.18 mlp dense
error: mlp.dense
model.layers.19 input_norm weight
model.layers.19 self_attention.query_key_value weight
model.layers.19 self_attention.query_key_value bias
model.layers.19 self_attention query
error: self_attention.query
model.layers.19 self_attention.dense weight
model.layers.19 self_attention dense
error: self_attention.dense
model.layers.19 post_attention_norm weight
model.layers.19 mlp.dense_h_to_4h weight
model.layers.19 mlp dense
error: mlp.dense
model.layers.19 mlp.dense_4h_to_h weight
model.layers.19 mlp dense
error: mlp.dense
model.layers.20 input_norm weight
model.layers.20 self_attention.query_key_value weight
model.layers.20 self_attention.query_key_value bias
model.layers.20 self_attention query
error: self_attention.query
model.layers.20 self_attention.dense weight
model.layers.20 self_attention dense
error: self_attention.dense
model.layers.20 post_attention_norm weight
model.layers.20 mlp.dense_h_to_4h weight
model.layers.20 mlp dense
error: mlp.dense
model.layers.20 mlp.dense_4h_to_h weight
model.layers.20 mlp dense
error: mlp.dense
model.layers.21 input_norm weight
model.layers.21 self_attention.query_key_value weight
model.layers.21 self_attention.query_key_value bias
model.layers.21 self_attention query
error: self_attention.query
model.layers.21 self_attention.dense weight
model.layers.21 self_attention dense
error: self_attention.dense
model.layers.21 post_attention_norm weight
model.layers.21 mlp.dense_h_to_4h weight
model.layers.21 mlp dense
error: mlp.dense
model.layers.21 mlp.dense_4h_to_h weight
model.layers.21 mlp dense
error: mlp.dense
model.layers.22 input_norm weight
model.layers.22 self_attention.query_key_value weight
model.layers.22 self_attention.query_key_value bias
model.layers.22 self_attention query
error: self_attention.query
model.layers.22 self_attention.dense weight
model.layers.22 self_attention dense
error: self_attention.dense
model.layers.22 post_attention_norm weight
model.layers.22 mlp.dense_h_to_4h weight
model.layers.22 mlp dense
error: mlp.dense
model.layers.22 mlp.dense_4h_to_h weight
model.layers.22 mlp dense
error: mlp.dense
model.layers.23 input_norm weight
model.layers.23 self_attention.query_key_value weight
model.layers.23 self_attention.query_key_value bias
model.layers.23 self_attention query
error: self_attention.query
model.layers.23 self_attention.dense weight
model.layers.23 self_attention dense
error: self_attention.dense
model.layers.23 post_attention_norm weight
model.layers.23 mlp.dense_h_to_4h weight
model.layers.23 mlp dense
error: mlp.dense
model.layers.23 mlp.dense_4h_to_h weight
model.layers.23 mlp dense
error: mlp.dense
Converting final layernorm
Converting LM head
Conversion from Megatron-LM to Transformers is done!
Saving config
Model weights saved in /mnt/raid/xujianjun/LLM_train/wjk/Megatron-LM/checkpoints/models--Qwen--Qwen1.5-0.5B-mg2hf41-aftertrain1_test/pytorch_model.bin


NOTE: Conversion is completed! Make sure to copy *.json files from Huggingface dir to /mnt/raid/xujianjun/LLM_train/wjk/Megatron-LM/checkpoints/models--Qwen--Qwen1.5-0.5B-mg2hf41-aftertrain1_test/
Exclude the model.safetensors.index.json file!
0 min 20 sec

训练和转换权重时设置的tp和pp都是4和1

关于超出序列长度截断

请问一下，现在我在这个框架上进行继续预训练，序列长度设为4096，但是有很多数据都超过了这个长度。框架具有自动拼接的功能吗？还是我需要手动把较长的数据分割，较短的数据拼接到一起？

Baichuan2 13B预训练过程的TP相关报错

想请教一下大佬们，我在一台有8张A100的机器上上进行实验，TP设置的是2，PP设置的是1。运行过程中报了如下错误。
同样的转化模型如果用简单的精调格式的json不会报错，但是用mmap格式的会出现报错。我的执行命令是：
sh run_pretrain_megatron_baichuan.sh dsw /mnt/home/gyx/Megatron-LM-f77/ /mnt/home/gyx/Pai-Megatron-Patch-main/ 13B 1 8 1e-5 1e-6 2048 2048 0 bf16 2 1 sel true false true false 100 /mnt/home/gyx/wudao_mmap/wudao_baichuanbpe_content_document /mnt/tenant-home_speed/gyx/model/baichuan2-13b-megatron-tp2-pp1_1031 100000 1000 /mnt/tenant-home_speed/gyx/model/output_baichuan2_megatron_tp2_pp1_1031

我的报错如下：
File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply

File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/mnt/home/gyx/Megatron-LM-f77/megatron/core/tensor_parallel/mappings.py", line 245, in forward
return super().apply(*args, **kwargs) # type: ignore[misc]
return super().apply(*args, **kwargs) # type: ignore[misc] File "/mnt/home/gyx/Megatron-LM-f77/megatron/core/tensor_parallel/mappings.py", line 245, in forward

File "/mnt/home/gyx/Megatron-LM-f77/megatron/core/tensor_parallel/mappings.py", line 245, in forward
return ScatterToSequenceParallelRegion.apply(input)
File "/usr/local/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/mnt/home/gyx/Megatron-LM-f77/megatron/core/tensor_parallel/mappings.py", line 245, in forward
return split_along_first_dim(input)
File "/mnt/home/gyx/Megatron-LM-f77/megatron/core/tensor_parallel/mappings.py", line 59, in _split_along_first_dim
dim_size % world_size == 0
AssertionError: First dimension of the tensor should be divisible by tensor parallel size
return split_along_first_dim(input)
File "/mnt/home/gyx/Megatron-LM-f77/megatron/core/tensor_parallel/mappings.py", line 59, in _split_along_first_dim
dim_size % world_size == 0
AssertionError: First dimension of the tensor should be divisible by tensor parallel size
return split_along_first_dim(input)
File "/mnt/home/gyx/Megatron-LM-f77/megatron/core/tensor_parallel/mappings.py", line 59, in _split_along_first_dim
dim_size % world_size == 0
AssertionError: First dimension of the tensor should be divisible by tensor parallel size
return split_along_first_dim(input)
File "/mnt/home/gyx/Megatron-LM-f77/megatron/core/tensor_parallel/mappings.py", line 59, in _split_along_first_dim
dim_size % world_size == 0
AssertionError: First dimension of the tensor should be divisible by tensor parallel size

AttributeError: 'torch._C._distributed_c10d.Options' object has no attribute 'config'

环境：
torch 2.0.1+cu118
python 3.10
Pai-megatron-patch 0.5.2
Megatron-LM 23.06_pt_23.09分支

启动脚本：
使用llama2-13B模型进行megatron微调，调用的脚本为
sh run_pretrain_megatron_llama.sh \ dsw \ /mnt/nas/gyj/software/Megatron-LM \ /mnt/nas/gyj/Pai-Megatron-Patch-main \ 13B \ 1 \ 8 \ 1e-5 \ 1e-6 \ 2048 \ 2048 \ 0 \ bf16 \ 1 \ 1 \ sel \ true \ true \ true \ 100000 \ ${DATASET_DIR} \ ${SOURCE_MODEL_PATH} \ 10000000000 \ 100000000 \ ${OUT_MODEL_PATH}

问题：
发生如下报错：
Traceback (most recent call last): File "/mnt/nas/gyj/Pai-Megatron-Patch-main/examples/llama2/pretrain_megatron_llama.py", line 96, in <module> pretrain(train_valid_test_datasets_provider, File "/mnt/nas/gyj/Pai-Megatron-Patch-main/megatron_patch/training.py", line 86, in pretrain pretrain(train_valid_test_datasets_provider, File "/mnt/nas/gyj/Pai-Megatron-Patch-main/megatron_patch/training.py", line 86, in pretrain pretrain(train_valid_test_datasets_provider, File "/mnt/nas/gyj/Pai-Megatron-Patch-main/megatron_patch/training.py", line 86, in pretrain pretrain(train_valid_test_datasets_provider, File "/mnt/nas/gyj/Pai-Megatron-Patch-main/megatron_patch/training.py", line 86, in pretrain pretrain(train_valid_test_datasets_provider, File "/mnt/nas/gyj/Pai-Megatron-Patch-main/megatron_patch/training.py", line 86, in pretrain pretrain(train_valid_test_datasets_provider, File "/mnt/nas/gyj/Pai-Megatron-Patch-main/megatron_patch/training.py", line 86, in pretrain initialize_megatron(extra_args_provider=extra_args_provider, File "/mnt/nas/gyj/software/Megatron-LM/megatron/initialize.py", line 76, in initialize_megatron initialize_megatron(extra_args_provider=extra_args_provider, File "/mnt/nas/gyj/software/Megatron-LM/megatron/initialize.py", line 76, in initialize_megatron finish_mpu_init() File "/mnt/nas/gyj/software/Megatron-LM/megatron/initialize.py", line 56, in finish_mpu_init _initialize_distributed() File "/mnt/nas/gyj/software/Megatron-LM/megatron/initialize.py", line 185, in _initialize_distributed mpu.initialize_model_parallel(args.tensor_model_parallel_size, File "/mnt/nas/gyj/software/Megatron-LM/megatron/core/parallel_state.py", line 172, in initialize_model_parallel nccl_options.config.cga_cluster_size = int(os.getenv("MEGATRON_CORE_DP_CGA_GROUPS", 4)) AttributeError: 'torch._C._distributed_c10d.Options' object has no attribute 'config'

RLHF is not supported by the Patch in Megatron-LLM but is adapted to deepspeed lib

https://github.com/alibaba/Pai-Megatron-Patch/blob/main/rlhf/README.md

The introduction means we have to convert the ckpt to hf model, then use the demo scripts of deepspeed examples or trl, which is not scalable.
The core advantages of Megatron is scaling model size, however the patch introduction is too tricky.

ModuleNotFoundError: No module named 'megatron.data.gpt_dataset'

您好，我在跑run_finetune_megatron_llama.sh时遇到这个问题，看了一下megatron/data下面确实没有gpt_dataset.py函数

百川2 的max-z loss，为啥代码里只有前向部分，没有反向部分？

我们团队自己实现了一个，可以一起讨论一下：
baichuan-inc/Baichuan2#220

how to install the training envirenments?

TransformerLayer.init() got an unexpected keyword argument 'apply_query_key_layer_scaling'

感谢作者大佬们的开源贡献哈。
基于Llama-2-13b-ms(https://modelscope.cn/models/modelscope/Llama-2-13b-ms/files)模型，适配之后，使用model_convertor.sh转换成功，如下图

再使用run_pretrain_megatron_llama.sh训练，指令如下：
sh run_pretrain_megatron_llama.sh dsw /git_model/Pai-Megatron-Patch 13B 1 8 1e-5 1e-6 1024 80 0 fp16 1 1 sel true false false true 100000 /llama2-datasets/wudao /Llama-2-13b-ms-to-megatron-tp1-pp1 100000000 10000 /output_megatron_llama2

报错，不匹配如下：

building LLamaTokenizer tokenizer ...
Traceback (most recent call last):
File "/git_model/Pai-Megatron-Patch/examples/llama2/pretrain_megatron_llama.py", line 126, in
pretrain(train_valid_test_datasets_provider, model_provider,
File "/git_model/Pai-Megatron-Patch/megatron_patch/training.py", line 109, in pretrain
model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
File "/git_model/Pai-Megatron-Patch/megatron_patch/training.py", line 327, in setup_model_and_optimizer
model = get_model(model_provider_func, model_type)
File "/git_model/Pai-Megatron-Patch/megatron_patch/training.py", line 256, in get_model
model = model_provider_func(
File "/git_model/Pai-Megatron-Patch/examples/llama2/pretrain_megatron_llama.py", line 38, in model_provider
model = GPTModel(
File "/git_model/Pai-Megatron-Patch/megatron_patch/model/llama2/gpt_model.py", line 71, in init
self.language_model, self._language_model_key = get_language_model(
File "/git_model/Pai-Megatron-Patch/megatron_patch/model/llama2/language_model.py", line 77, in get_language_model
language_model = TransformerLanguageModel(
File "/git_model/Pai-Megatron-Patch/megatron_patch/model/llama2/language_model.py", line 395, in init
self.encoder = ParallelTransformer(
File "/git_model/Pai-Megatron-Patch/megatron_patch/model/llama2/transformer.py", line 1630, in init
[build_layer(i + 1 + offset) for i in range(self.num_layers)])
File "/git_model/Pai-Megatron-Patch/megatron_patch/model/llama2/transformer.py", line 1630, in
[build_layer(i + 1 + offset) for i in range(self.num_layers)])
File "/git_model/Pai-Megatron-Patch/megatron_patch/model/llama2/transformer.py", line 1556, in build_layer
return transformer_engine.pytorch.TransformerLayer(
TypeError: TransformerLayer.init() got an unexpected keyword argument 'apply_query_key_layer_scaling'

训练baichuan2 13b 报 KeyError: 'instruction'

building BaichuanTokenizer tokenizer ...
Traceback (most recent call last):
File "/GPU/Pai-Megatron-Patch/examples/baichuan2/pretrain_megatron_baichuan.py", line 125, in
pretrain(train_valid_test_datasets_provider, model_provider,
File "/GPU/Pai-Megatron-Patch/megatron_patch/training.py", line 133, in pretrain
= build_train_valid_test_data_iterators(
File "/GPU/Pai-Megatron-Patch/Megatron-LM-main/megatron/training.py", line 1005, in build_train_valid_test_data_iterators
build_train_valid_test_data_loaders(
File "/GPU/Pai-Megatron-Patch/Megatron-LM-main/megatron/training.py", line 963, in build_train_valid_test_data_loaders
train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
File "/GPU/Pai-Megatron-Patch/Megatron-LM-main/megatron/training.py", line 936, in build_train_valid_test_datasets
return build_train_valid_test_datasets_provider(train_val_test_num_samples)
File "/GPU/Pai-Megatron-Patch/examples/baichuan2/pretrain_megatron_baichuan.py", line 107, in train_valid_test_datasets_provider
build_pretrain_dataset_from_original(args.dataset)
File "GPU/Pai-Megatron-Patch/megatron_patch/data/init.py", line 69, in build_pretrain_dataset_from_original
train_dataset = LLamaRawDataset(args.train_data_path, args.max_padding_length)
File "/Pai-Megatron-Patch/megatron_patch/data/llama.py", line 52, in init
sources = [
File "/Pai-Megatron-Patch/megatron_patch/data/llama.py", line 54, in
else prompt_no_input.format_map(example)
KeyError: 'instruction'

`
之前还有一个 attention_softmax_in_fp32 找不到，我删了代码

No such file or directory: '/mtn/workplace/qwen-ckpts/qwen-14b-hf-to-megatron-tp2-pp1/release/mp_rank_00/model_optim_rng.pt

按照教程将模型从hf格式转换成megaton格式。然后单机多卡预训练。报错：
[Errno 2] No such file or directory: '/mtn/workplace/qwen-ckpts/qwen-14b-hf-to-megatron-tp2-pp1/release/mp_rank_00/model_optim_rng.pt'。该目录下实际文件为model_rng.pt

本地运行镜像，不成功

指令：
sudo docker run pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/pytorch-training:2.0-ubuntu20.04-py3.10-cuda11.8-megatron-patch-llm

只输出了下面内容：

== NVIDIA TensorRT ==

NVIDIA Release 22.09 (build 44877791)
NVIDIA TensorRT Version 8.5.0
Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

https://developer.nvidia.com/tensorrt

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh

To install the open-source samples corresponding to this TensorRT release version
run /opt/tensorrt/install_opensource.sh. To build the open source parsers,
plugins, and samples for current top-of-tree on master or a different branch,
run /opt/tensorrt/install_opensource.sh -b
See https://github.com/NVIDIA/TensorRT for more information.

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .

Got a bug during the pretrain of chatglm.

Traceback (most recent call last):
  File "pretrain_megatron_chatglm.py", line 22, in <module>
    from megatron_patch.data.finetune_dataset import ChatGLMDataset
ModuleNotFoundError: No module named 'megatron_patch.data.finetune_dataset'

Got this error while do pretrain on chatglm. Is the chatglm still under development?

支持rwkv

请问一下可以支持rwkv模型训练吗？https://github.com/BlinkDL/RWKV-LM

您好，请问什么时候可以支持baichuan2呀～

比如在megatron中支持max-z loss， normhead等，还有支持baichuan2中的alibi + flash attention 2的结合。
另外，因为我之前也实现了megatron中的llama模型实现，其中huggingface的llama权重中没有model.layers.X.post_attention_layernorm.bias，并且post_attention_layernorm 应该是RMSNorm，而不是LayerNorm ～

Finetune Qwen-72B

请问采用Pai-Megatron-Patch微调Qwen-72B的最小训练资源要求是多少？

Mixtral model convert no "convert_checkpoint_from_megatron_to_transformers" function

hi,
The script Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/mixtral/model_convertor.sh give a parameter "--convert_checkpoint_from_megatron_to_transformers", however, these no "convert_checkpoint_from_megatron_to_transformers" function in checkpoint_reshaping_and_interoperability.py.

Mistral持续预训练的脚本是否支持pp>=2？

我在使用最新版的Pai-Megatron-Patch对Mistral-7b-v0.1进行持续预训练。

当我设置tp=1/pp=2或者tp=1/pp=4或者tp=2/pp=2时，均出现了下述错误：

File "/my_workspace/Pai-Megatron-Patch/megatron_patch/model/mistral/modeling_attn_mask_utils.py", line 198, in _prepare_4d_causal_attention_mask    
attention_mask, input_shape[-1], key_value_length, dtype=inputs_embeds.dtype                                                                   AttributeError: 'NoneType' object has no attribute 'dtype'

我的机器有8块显卡，我在_prepare_4d_causal_attention_mask函数内打印了：print(attention_mask.device，input_shape，inputs_embeds)
得到如下结果:

cuda:0 (1, 2048) 一个Tensor
cuda:1 (1, 2048) 一个Tensor
cuda:2 (1, 2048) 一个Tensor
cuda:3 (1, 2048) 一个Tensor
cuda:0 (1, 2048) None
cuda:1 (1, 2048) None
cuda:2 (1, 2048) None
cuda:3 (1, 2048) None

我自始至终未能看到来自cuda:4~cuda:7的输出。

当我设置tp=2 or 4/pp=1时，不会遇到这个错误。

我希望确认一下，目前Pai-Megatron-Patch是否能支持在pp>=2的情况下训练Mistral-7b？

qwen 7B 增量预训练，模型加载完，卡在 dataloader 部分, seq_length=0

一直卡在数据处理位置，处理时间很长，3m数据十分钟都没处理完，debug到此处的 seq_length=0 ：

Pretrain megatron qwen-7b-tp4-pp1 报错 151851 is not divisible by 4

AssertionError: 151851 is not divisible by 4
Traceback (most recent call last):
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/examples/qwen/pretrain_megatron_qwen.py", line 125, in
pretrain(train_valid_test_datasets_provider, model_provider,
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/megatron_patch/training.py", line 109, in pretrain
model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/megatron_patch/training.py", line 327, in setup_model_and_optimizer
model = get_model(model_provider_func, model_type)
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/megatron_patch/training.py", line 256, in get_model
model = model_provider_func(
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/examples/qwen/pretrain_megatron_qwen.py", line 38, in model_provider
model = GPTModel(
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/megatron_patch/model/qwen/gpt_model.py", line 71, in init
self.language_model, self._language_model_key = get_language_model(
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/megatron_patch/model/qwen/language_model.py", line 77, in get_language_model
language_model = TransformerLanguageModel(
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/megatron_patch/model/qwen/language_model.py", line 363, in init
self.embedding = Embedding(self.hidden_size,
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/megatron_patch/model/qwen/language_model.py", line 161, in init
self.word_embeddings = tensor_parallel.VocabParallelEmbedding(
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/Megatron-LM-main/megatron/core/tensor_parallel/layers.py", line 171, in init
) = VocabUtility.vocab_range_from_global_vocab_size(
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/Megatron-LM-main/megatron/core/tensor_parallel/utils.py", line 110, in vocab_range_from_global_vocab_size
per_partition_vocab_size = divide(global_vocab_size, world_size)
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/Megatron-LM-main/megatron/core/utils.py", line 22, in divide
ensure_divisibility(numerator, denominator)
File "/home/nlp/shi/megatron/Pai-Megatron-Patch/Megatron-LM-main/megatron/core/utils.py", line 16, in ensure_divisibility
assert numerator % denominator == 0, "{} is not divisible by {}".format(numerator, denominator)

个人觉得应该是 configuration_qwen.py 中 vocab_size 的问题。我看官方的Qwen7B词表为151936，但是即使我修改了文件中的vocab_size 为 151936之后再次尝试，依然是这个报错。请问该如何解决。

qwen/hf2mcore_1.5_v2.py将hf转为mcore格式，当tp大于1时会报错

如题，使用7B，设置tp=2，pp=1，转换时报如下错：
File "/ossfs/workspace/megatron-moe/Pai-Megatron-Patch-0408-train/toolkits/model_checkpoints_convertor/qwen/hf2mcore_1.5_v2.py", line 461, in save_mgmodel
viewed = v.view(args.num_query_groups, -1, head_dim, args.hidden_size)
RuntimeError: shape '[32, -1, 128, 4096]' is invalid for input of size 12288

Will Pai-Megatron-Patch support PEFT and quantization?

Thanks for your hard work and the contribution! I'm wondering if Pai-Megatron-Patch will support PEFT (Parameter Efficient Fine-Tuning) techniques such as LoRA, as well as quantization techniques such as QLoRA, GPT-Q, or AWQ in the near future? It would be of great help. Thanks a lot!

存在一些bug，目前mistral llama13 sft能跑通-已解决

存在一些bug，目前mistral llama13 sft能跑通

finetune qwen1.4-4B with tp=2, failed when load model with embedding_shape.

training qwe1.5-4b with tp=2 failed with embedding-table load error.

I start train job as follow.

convert model

I convert the qwen1.5-4B model to tp=2 and pp=1,with command

/workspace/llm_train/Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/qwen# bash hf2megatron_convertor.sh  ../../../Megatron-LM-231007/ /hf_cache/hub/Qwen1.5-4B   /hf_cache/hub/models--Qwen--Qwen1.5-4B-Chat-hf2mg21/ 2 1 qwen1.5 0 false
Zarr-based strategies will not be registered because of missing packages
Converting
converting embedding layer
converting transformer layers
0 min 20 sec

start training job

then i start the train job with

set -ex
export WORK_DIR=/workspace/llm_train
cd ${WORK_DIR}/Pai-Megatron-Patch/examples/qwen1_5
sh run_finetune_megatron_qwen.sh \
        qs \
        ${WORK_DIR}/Pai-Megatron-Patch \
        4B 1 1e-5 1e-6 80 81 \
        1 \
        bf16 \
        2 1 sel true true true false  \
        /workspace/llm_train/qwen-datasets/alpaca_zh-qwen-train.json \
        /workspace/llm_train/qwen-datasets/alpaca_zh-qwen-valid.json \
        /hf_cache/hub/models--Qwen--Qwen1.5-4B-Chat-hf2mg21  \
        2 \
        ${WORK_DIR}/output_patch_test

the start job log show

+ cd /workspace/llm_train/Pai-Megatron-Patch/examples/qwen1_5
+ sh run_finetune_megatron_qwen.sh qs /workspace/llm_train/Pai-Megatron-Patch 4B 1 1e-5 1e-6 80 81 1 bf16 2 1 sel true true true false /workspace/llm_train/qwen-datasets/alpaca_zh-qwen-train.json /workspace/llm_train/qwen-datasets/alpaca_zh-qwen-valid.json /hf_cache/hub/models--Qwen--Qwen1.5-4B-Chat-hf2mg21 2 /workspace/llm_train/output_patch_test /workspace/llm_train/qwen-datasets/alpaca_zh-qwen-train.json /workspace/llm_train/qwen-datasets/alpaca_zh-qwen-valid.json
torchrun --nproc_per_node 4 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 63954 ../llama2/finetune_megatron_llama.py --load /hf_cache/hub/models--Qwen--Qwen1.5-4B-Chat-hf2mg21 --save /workspace/llm_train/output_patch_test/checkpoint/qs-finetune-megatron-llama-4B-lr-1e-5-ep-2-bs-1-seqlen-80-pr-bf16--do-true-tp-2-ac-sel-sp-true --train-data /workspace/llm_train/qwen-datasets/alpaca_zh-qwen-train.json --valid-data /workspace/llm_train/qwen-datasets/alpaca_zh-qwen-valid.json --train-data-path /workspace/llm_train/qwen-datasets/alpaca_zh-qwen-train.json --valid-data-path /workspace/llm_train/qwen-datasets/alpaca_zh-qwen-valid.json --num-layers 40 --hidden-size 2560 --num-attention-heads 20 --seq-length 80 --max-position-embeddings 80 --ffn-hidden-size 6912 --keep-last --micro-batch-size 1 --epochs 2 --lr 1e-5 --min-lr 1e-6 --lr-decay-style cosine --weight-decay 0.1 --clip-grad 1.0 --adam-beta1 0.9 --adam-beta2 0.95 --init-method-std 0.01 --num-workers 0 --log-interval 1 --eval-interval 1000 --eval-iters 10 --save-interval 1000000 --tensorboard-queue-size 1 --dataset LLama-SFT --tensorboard-dir /workspace/llm_train/output_patch_test/tensorboard/qs-finetune-megatron-llama-4B-lr-1e-5-ep-2-bs-1-seqlen-80-pr-bf16--do-true-tp-2-ac-sel-sp-true_2024.04.17-12.11.14 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 --finetune --no-load-optim --no-load-rng --seed 1234 --max-padding-length 81 --extra-vocab-size 1 --patch-tokenizer-type LLamaTokenizer --swiglu --normalization RMSNorm --use-llama2-rotary-position-embeddings --position-embedding-type rope --untie-embeddings-and-output-weights --rotary-base 1000000 --rotary-scale-factor 1 --bf16 --load /hf_cache/hub/models--Qwen--Qwen1.5-4B-Chat-hf2mg21 --recompute-activations --use-distributed-optimizer --use-flash-attn --sequence-parallel
WARNING:torch.distributed.run:

training error

  storage = bucket.data.storage()._untyped()
 loading release checkpoint from /hf_cache/hub/models--Qwen--Qwen1.5-4B-Chat-hf2mg21
Traceback (most recent call last):
  File "/workspace/llm_train/Pai-Megatron-Patch/examples/qwen1_5/../llama2/finetune_megatron_llama.py", line 88, in <module>
    finetune(train_valid_datasets_provider=train_valid_datasets_provider,
  File "/workspace/llm_train/Pai-Megatron-Patch/megatron_patch/finetune_utils.py", line 310, in finetune
    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
  File "/workspace/llm_train/Pai-Megatron-Patch/Megatron-LM-231007/megatron/training.py", line 382, in setup_model_and_optimizer
    args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler)
  File "/workspace/llm_train/Pai-Megatron-Patch/Megatron-LM-231007/megatron/checkpointing.py", line 586, in load_checkpoint
    model[0].load_state_dict(state_dict['model'], strict=strict)
  File "/workspace/llm_train/Pai-Megatron-Patch/megatron_patch/model/llama2/gpt_model.py", line 132, in load_state_dict
    self.language_model.load_state_dict(state_dict, strict=strict)
  File "/workspace/llm_train/Pai-Megatron-Patch/megatron_patch/model/llama2/language_model.py", line 603, in load_state_dict
    self.embedding.load_state_dict(state_dict_, strict=strict)
  File "/workspace/llm_train/Pai-Megatron-Patch/megatron_patch/model/llama2/language_model.py", line 285, in load_state_dict
    self.word_embeddings.load_state_dict(state_dict_, strict=strict)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2040, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for VocabParallelEmbedding:
        size mismatch for weight: copying a param with shape torch.Size([75968, 2560]) from checkpoint, the shape in current model is torch.Size([75822, 2560]).

RuntimeError: The size of tensor a (120) must match the size of tensor b (119) at non-singleton dimension 2

examples/llama2/run_finetune_megatron_llama.sh, 执行报错:
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-hdp/wangxiang27/code/Pai-Megatron-Patch/megatron_patch/model/llama2/rotary_pos_embedding.py", line 91, in apply_rotary_pos_emb
q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: The size of tensor a (120) must match the size of tensor b (119) at non-singleton dimension 2

配置: dp = 2, pp = 2, tp = 2
#sh run_finetune_megatron_llama.sh dsw ../.. 7B 1 1e-5 1e-6 120 120 0 bf16 2 2 sel true false false false /mnt/llama2-datasets/alpaca_data.json /mnt/llama2-datasets/alpaca_data.json /mnt/llama2-ckpts/Llama-2-7b-hf-to-mg-tp2-pp2/ 2 /mnt/output_patch_test

微调llama时遇到的的bug

您好，我在运行run_finetune_megatron_llama.sh时遇到了下面的bug：

Traceback (most recent call last):
  File "finetune_megatron_llama.py", line 5, in <module>
    from megatron_patch.data.finetune_dataset import LLamaDataset
ModuleNotFoundError: No module named 'megatron_patch.data.finetune_dataset'

好像与issue 119是一样的问题？

微调多轮对话的语料格式是什么

多机多卡相关

目前看到的示例是单机多卡的，想请教一下大佬们，如果想运行多级多卡实验应该怎么配置呢？

代码在本地单机上是否能用

4个Qwen1.5微调代码中的问题

您好，我在学习和测试项目代码的时候，发现有如下的几个地方用项目原始代码跑不通，想请您看一下是否是我的使用方式有问题：

使用https://github.com/alibaba/Pai-Megatron-Patch/blob/main/toolkits/model_checkpoints_convertor/qwen/checkpoint_reshaping_and_interoperability_qwen1.5.py 将HF的模型转换为Megatron模型之后，需要手动将tokenizer的merges.txt复制过来，否则无法正确加载tokenizer
https://github.com/alibaba/Pai-Megatron-Patch/blob/main/megatron_patch/tokenizer/__init__.py#L104 这里读取的vocab_size和模型实际的embed_token维度不同，会导致后续无法加载模型权重。Qwen1.5的tokenizer词数(151643)和模型的embed_tokens维度(152064)对不上
使用pipeline parallel（PP为2）时，第二个GPU上的第一个layer拿到的hidden state会多一维，检查下来发现是hidden_states = self.input_tensor 这一行读取的input_tensor的第一维多了一维。这个问题目前不知道怎么解决
PP改为1后，第3个问题暂时不再出现，但是此时会在megatron_patch/training.py的第344行update_successful, grad_norm, num_zeros_in_grad = optimizer.step(args, timers)处报错RuntimeError: Unknown layout，这个问题也不知道怎么解决。

希望上述问题能得到您的回答，谢谢～

使用huggingface + deepspeed stage 3训练codellama 报错，stage2 可以，这是为什么？

assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 1247, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {565}, 'ds_tensor.shape': torch.Size([0])}

Baichuan2 13B pretrain alibi_attn_mask TP=2 pp=4

开发者你们好，我在落地该项目时发现，baichuan2 (num_head = 40) 在TP=2的时候 query_layer的num_head = 20, 而alibi_attn_mask = 40，请问是不是alibi_attn_mask没有进行切分。由于每个头的mask值都不同，所以存在报错。
对应代码部分： baichuan2/transformer.py/CoreAttention/forward 304行左右，attention_scores = attention_scores + attention_mask。（版本 Commit#26）

Pai-Megatron-Patch和Megatron-LM分别应该选择什么版本？

尊敬的开发者们你们好，跑了你们的examples/llama2/run_finetune_megatron_llama.sh脚本，根据你们脚本里的命令示例，选了Megatron-LM23.04版本，但是一直报错说多了一个seq-length参数，请问你们这两个代码库Pai-Megatron-Patch和Megatron-LM分别应该选择什么版本呢？

感谢回复

如何实现基于Iter的Finetune？

如下图所示，咱框架给出的微调示例是基于epoch实现的，这不支持梯度累积功能；请问咱框架中，如何实现基于Iter的Finetune训练呢？

模型转换脚本从megatron转为huggingface格式存在不匹配

使用llama2模型转换脚本，从megatron转为huggingface格式时，显示转换失败：
Traceback (most recent call last):
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-speech/users/chenxiang22/e2e/llm/llm_train_envir/Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/llama/checkpoint_reshaping_and_interoperability.py", line 1102, in
main()
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-speech/users/chenxiang22/e2e/llm/llm_train_envir/Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/llama/checkpoint_reshaping_and_interoperability.py", line 1096, in main
convert_checkpoint_from_megatron_to_transformers(args)
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-speech/users/chenxiang22/e2e/llm/llm_train_envir/Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/llama/checkpoint_reshaping_and_interoperability.py", line 1019, in convert_checkpoint_from_megatron_to_transformers
out_name = megatron_to_transformers[op_name]
KeyError: 'self_attention.query'

采用megatron-deepspeed训练的llama2模型，k，q，v权重矩阵名称为：'self_attention.query.weight': torch.Size([2048, 2048]), 'self_attention.key_value.weight': torch.Size([1024, 2048])
但是看到转换脚本中只对self_attention.query_key_value做了转化，是没有考虑到这种情况吗，有没有其他解决方式？

pretrain_mcore_llama.py 报错

报错信息为
Traceback (most recent call last): File "/mnt/workspace/wangweikuan/Pai-Megatron-Patch/examples/llama2/pretrain_mcore_llama.py", line 271, in <module> pretrain( File "/mnt/cpfs2/wangweikuan/Megatron-LM/megatron/training.py", line 218, in pretrain model, optimizer, opt_param_scheduler = setup_model_and_optimizer( File "/mnt/cpfs2/wangweikuan/Megatron-LM/megatron/training.py", line 484, in setup_model_and_optimizer model = get_model(model_provider_func, model_type) File "/mnt/cpfs2/wangweikuan/Megatron-LM/megatron/training.py", line 368, in get_model model = model_provider_func( File "/mnt/workspace/wangweikuan/Pai-Megatron-Patch/examples/llama2/pretrain_mcore_llama.py", line 67, in model_provider model = GPTModel( File "/mnt/cpfs2/wangweikuan/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 89, in __init__ rotary_interleaved=self.config.rotary_interleaved, AttributeError: 'TransformerConfig' object has no attribute 'rotary_interleaved'

怀疑因为这个文件history上，从moe改过来，有些地方没改正导致的？
from megatron_patch.model.mixtral.transformer_config import TransformerConfig from megatron_patch.model.mixtral.layer_specs import get_gpt_layer_with_transformer_engine_spec
mixtral.transformer_config中是没有rotary_interleaved的，但是megatron/core/transformer/transformer_config是有的

Minimum training example for Qwen 1.5 72B

Hi, would you be able to share an example training script for Qwen 1.5 72B? Is it significantly different to Qwen 1?

Ideally, I'd like both a script for full fine-tuning and for LoRA/QLoRA.

Also, what are the minimum requirements for training a model of this size? I.e. how many A100/A800s would we need to fine-tune Qwen 1.5 72B?

I see that Qwen 1 72B requires 4 A100 as described in #95

Thanks and 谢谢!

有支持Gemma的计划吗

PAI有支持Gemma框架的打算吗

Please update Falcon :)

Training baichuan2 13b, when pp>1, it will nanloss when flash attention is turned off.

It might be an issue with the attention mask.

The startup script is:
sh run_pretrain_megatron_baichuan.sh dsw /workdir/fengyu05/Pai-Megatron-Patch 13B 1 8 1e-5 1e-6 4096 4096 0 bf16 4 2 sel true false true false 100000 /mnt/dolphinfs/hdd_pool/docker/share/llama_efficient/comparison_gpt4_data_en.json /mnt/dolphinfs/hdd_pool/docker/user/fengyu05/model/baichuan-13b-base-hf-to-megatron-tp4-pp2_1 100000000 10000 output_baichuan2

error is:

Traceback (most recent call last):
    assert not total_norm.isnan(), (
  File "/workdir/fengyu05/Pai-Megatron-Patch/examples/baichuan2/pretrain_megatron_baichuan.py", line 101, in <module>
AssertionError: Rank 3: found NaN in local grad norm in backwards pass. Device: 3, node: set-zw04-kubernetes-pc133.mt
    pretrain(train_valid_test_datasets_provider,
  File "/workdir/fengyu05/Pai-Megatron-Patch/megatron_patch/training.py", line 152, in pretrain
    iteration = train(forward_step_func,
  File "/workdir/fengyu05/Pai-Megatron-Patch/megatron_patch/training.py", line 652, in train
    train_step(forward_step_func,
  File "/workdir/fengyu05/Pai-Megatron-Patch/megatron_patch/training.py", line 385, in train_step
    update_successful, grad_norm, num_zeros_in_grad = optimizer.step(args, timers)
  File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workdir/fengyu05/Pai-Megatron-Patch/Megatron-LM-main/megatron/optimizer/optimizer.py", line 329, in step
    grad_norm = self.clip_grad_norm(self.clip_grad,
  File "/workdir/fengyu05/Pai-Megatron-Patch/Megatron-LM-main/megatron/optimizer/optimizer.py", line 111, in clip_grad_norm
    return clip_grad_norm_fp32(
  File "/workdir/fengyu05/Pai-Megatron-Patch/Megatron-LM-main/megatron/optimizer/clip_grads.py", line 98, in clip_grad_norm_fp32
    assert not total_norm.isnan(), (
AssertionError: Rank 1: found NaN in local grad norm in backwards pass. Device: 1, node: set-zw04-kubernetes-pc133.mt

Qwen 72B 中 megatron 和 huggingface 的不一致

在https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwen/run_pretrain_megatron_qwen.sh 中，rotary-seq-len-interpolation-factor=4， RotaryEmbedding的base=10000

但是在huggingface 的config 中RotaryEmbedding 的base=1000000

请问那种方式更贴合预训练的配置？

When will support for converting Megatron models to Hugging Face model with pipeline parallelism enabled be available for Qwen1.5 models?

I noticed that the hf2mcore_1.5_v2.py script in Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/qwen doesn't currently support pipeline parallelism. Are there any plans to add this feature?

fp8精度权重转换

您好，想问一下，我使用hf2te.sh转换llama权重之后做了微调，得到的megatron权重怎么再转回为hf的格式呢，有可以推荐的方法吗

模型训练所需显存显著高于MegatronDeepspeed框架

在使用咱们框架训练Yi6B模型时，我遇到了以下的情况：
同样使用4块A800显卡和长度填充到6000token的训练数据，做无数据并行、无模型并行、4张量并行的分布式训练，在MegatronDeepspeed框架micro-batch-size可以设置为8，而在咱们的框架中仅能设置为1，请问这种资源消耗情况正常吗？
有何方法能在尽量保证训练效果的同时减少显存占用吗？

请问训练如何resume？

TensorCore利用率低

你好，
pretrain qwen1.5-7B发现tensorcore利用率只有不到20%，远远低于megatron-LM的50%+，请问有哪些配置可提升利用率吗？

error: unrecognized arguments: --DDP-impl local

Hi, I'm trying to run run_finetune_huggingface_chatglm.sh
but failed with this error
it seems like --DDP-impl is not a megatron arguments

finetune_huggingface_chatglm.py: error: unrecognized arguments: --DDP-impl local
[2023-09-18 08:19:24,158] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 986) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.1.0a0+b5021ba', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

执行解压缩模型报错：tar -zxf qwen-7b-hf-to-mg-tp1-pp1.tgz

我按照文档操作模型，解压缩后报错。
文档：
https://developer.aliyun.com/article/1389735
指令：
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-ckpts/qwen-7b-hf-to-mg-tp1-pp1.tgz
tar -zxf qwen-7b-hf-to-mg-tp1-pp1.tgz

报错：
tar zxvf qwen-7b-hf-to-mg-tp1-pp1.tgz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now