GithubHelp home page GithubHelp logo

internlm / internevo Goto Github PK

View Code? Open in Web Editor NEW
281.0 9.0 46.0 6.79 MB

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

Home Page: https://internevo.readthedocs.io/zh-cn/latest/?badge=latest

License: Apache License 2.0

Shell 0.45% Python 96.89% Makefile 0.16% C++ 1.31% Cuda 1.19%
deepspeed-ulysses gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism

internevo's Introduction

InternEvo

Latest News 🔥

  • 2024/08/29: InternEvo supports streaming dataset of huggingface format. Add detailed instructions of data flow.

  • 2024/04/17: InternEvo supports training model on NPU-910B cluster.

  • 2024/01/17: To delve deeper into the InternLM series of models, please check InternLM in our organization.

Introduction

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies. With a single codebase, it supports pre-training on large-scale clusters with thousands of GPUs, and fine-tuning on a single GPU while achieving remarkable performance optimizations. InternEvo achieves nearly 90% acceleration efficiency during training on 1024 GPUs.

Based on the InternEvo training framework, we are continually releasing a variety of large language models, including the InternLM-7B series and InternLM-20B series, which significantly outperform numerous renowned open-source LLMs such as LLaMA and other leading models in the field.

Installation

First, install the specified versions of torch, torchvision, torchaudio, and torch-scatter. For example:

pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.1.0+cu118 torchvision==0.16.0+cu118 torchaudio==2.1.0+cu118
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.1.0+cu118.html

Install InternEvo:

pip install InternEvo

Install flash-attention (version v2.2.1):

If you need to use flash-attention to accelerate training, and it is supported in your environment, install as follows:

pip install flash-attn==2.2.1

For more detailed information about installation environment or source code installation, please refer to Install Tutorial

Quick Start

Train Script

Firstly, prepare training script as train.py

For more detailed explanation, please refer to Training Tutorial

Data Preparation

Secondly, prepare data for training or fine-tuning.

Download dataset from huggingface, take roneneldan/TinyStories dataset as example:

huggingface-cli download --repo-type dataset --resume-download "roneneldan/TinyStories" --local-dir "/mnt/petrelfs/hf-TinyStories"

Achieve tokenizer to local path. For example, download special_tokens_map.json、tokenizer.model、tokenizer_config.json、tokenization_internlm2.py and tokenization_internlm2_fast.py from https://huggingface.co/internlm/internlm2-7b/tree/main to local /mnt/petrelfs/hf-internlm2-tokenizer .

Then modify configuration file as follows:

TRAIN_FOLDER = "/mnt/petrelfs/hf-TinyStories"
data = dict(
    type="streaming",
    tokenizer_path="/mnt/petrelfs/hf-internlm2-tokenizer",
)

For other type dataset preparation, please refer to Usage Tutorial

Configuration File

The content of configuration file is as 7B_sft.py

For more detailed introduction, please refer to Usage Tutorial

Train Start

Training can be started on slurm or torch distributed environment.

On slurm, using 2 nodes and 16 cards, the command is as follows:

$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/7B_sft.py

On torch, using 1 node and 8 cards, the command is as follows:

$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py --launcher "torch"

System Architecture

Please refer to the System Architecture document for architecture details.

Feature Zoo

InternEvo Feature Zoo
Data Model Parallel Tool
  • Tokenized
  • Streaming
  • ZeRO 1.5
  • 1F1B Pipeline Parallel
  • PyTorch FSDP Training
  • Megatron-LM Tensor Parallel (MTP)
  • Megatron-LM Sequence Parallel (MSP)
  • Flash-Attn Sequence Parallel (FSP)
  • Intern Sequence Parallel (ISP)
  • Memory Profiling

Common Tips

Item Introduction
Parallel Computing Loss link

Contribution

We appreciate all the contributors for their efforts to improve and enhance InternEvo. Community users are highly encouraged to participate in the project. Please refer to the contribution guidelines for instructions on how to contribute to the project.

Acknowledgements

InternEvo codebase is an open-source project contributed by Shanghai AI Laboratory and researchers from different universities and companies. We would like to thank all the contributors for their support in adding new features to the project and the users for providing valuable feedback. We hope that this toolkit and benchmark can provide the community with flexible and efficient code tools for fine-tuning InternEvo and developing their own models, thus continuously contributing to the open-source community. Special thanks to the two open-source projects, flash-attention and ColossalAI.

Citation

@misc{2023internlm,
    title={InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities},
    author={InternLM Team},
    howpublished = {\url{https://github.com/InternLM/InternLM}},
    year={2023}
}

internevo's People

Contributors

00index avatar blankde avatar del-zhenwu avatar gaoyang07 avatar harold-lkk avatar hellock avatar huangting4201 avatar jiaopl avatar kimmishi avatar kkscilife avatar leeeizhang avatar li126com avatar lvhan028 avatar mwiacx avatar pryest avatar sallyjunjun avatar solenoidwgt avatar sunpengsdu avatar vansin avatar x54-729 avatar yhcc avatar yingtongxiong avatar ywmditto avatar zachtzy avatar zaglc avatar zehuichen123 avatar zhangxc11 avatar zhjunqin avatar zigzagcai avatar zwwwayne avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

internevo's Issues

[Bug] 昇腾910微调internLM报错

Describe the bug

Traceback (most recent call last):
File "/root/miniconda3/envs/internLM/lib/python3.8/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/root/miniconda3/envs/internLM/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/root/miniconda3/envs/internLM/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/root/miniconda3/envs/internLM/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/root/miniconda3/envs/internLM/lib/python3.8/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/root/miniconda3/envs/internLM/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
File "/root/miniconda3/envs/internLM/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/root/miniconda3/envs/internLM/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
BrokenPipeError: [Errno 32] Broken pipe
File "/root/miniconda3/envs/internLM/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)

During handling of the above exception, another exception occurred:

File "/root/miniconda3/envs/internLM/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
Traceback (most recent call last):
BrokenPipeError: [Errno 32] Broken pipe

Environment

python==3.8
torch==2.0.1

Other information

No response

[Bug] 使用moe的config微调报错

描述该错误

非常感谢您的工作!
我在使用代码进行sft时遇到了一个问题。在不使用moe的config时能够很好的运行,在使用moe的config文件后报错。
运行代码:

torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_MoE4_sft.py --launcher "torch"

报错信息:

Traceback (most recent call last):
  File "train.py", line 324, in <module>
    main(args)
  File "train.py", line 105, in main
    model = initialize_model()
  File "/root/wbq/internlm_moe/InternEvo/internlm/utils/timeout.py", line 102, in wrapper
    result = func(*args, **kwargs)
  File "/root/wbq/internlm_moe/InternEvo/internlm/train/pipeline.py", line 167, in initialize_model
    model = MODEL_INITIALIZER.get_module(module_name=gpc.config.model_type)(**(gpc.config.model))
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 584, in build_model_with_moe_cfg
    return _build_generic_model_1d(num_layers=num_layers, num_chunks=num_chunks, **cfg)
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 482, in _build_generic_model_1d
    chunk = PackedFlashInternLm1D(**filter_kwargs(PackedFlashInternLm1D.__init__, kwargs)).to(device)
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 356, in __init__
    [
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 357, in <listcomp>
    PackedFlashBaseLayer1D(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 94, in __init__
    self.mixer = MHA(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modules/multi_head_attention.py", line 364, in __init__
    self.rotary_emb = RotaryEmbedding(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modules/embedding.py", line 287, in __init__
    self.inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim))
TypeError: arange() received an invalid combination of arguments - got (int, int, int, dtype=torch.dtype, device=device), but expected one of:
 * (Number end, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (Number start, Number end, *, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (Number start, Number end, Number step, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

环境信息

torch==2.1.0+cu118
transformers<4.30.0
sentencepiece
numpy
tqdm
psutil
packaging
pre-commit
ninja
gputil
pytest
packaging
boto3
botocore
torch-scatter
pyecharts
py-libnuma
pynvml
tensorboard

其他信息

1、我只修改了./configs/7B_MoE4_sft.py中训练集和测试集的地址

[Feature] Should we remove other dependency of flashattention?

Describe the feature

Should we remove other dependency of flash-attention, and only keep the core attention related ops?

If possible, we can only use pip to install flash-attention, avoiding a lot of compiling operations.

To seek whether it is possible, we need to check whether it would reduce the training performance a lot.

Will you implement it?

  • I would like to implement this feature and create a PR!

[Feature] define a new config named "use_packed_dataset"

Describe the feature

Currently, we only use "use_flash_attention" to define that the dataset is packed. In the future, we need to extend this ability to multiple chips. Thus we need to define a new configuration named use_packed_dataset to control the logic in the training system in stead of always using "use_flash_attention". The default value would be true.

Will you implement it?

  • I would like to implement this feature and create a PR!

[Feature] 放松依赖版本限制

描述该功能

目前仓库对pip安装的依赖库版本有严格的指定,对于广泛的使用会有一定的不便。需要逐步放松对应的限制

是否希望自己实现该功能?

  • 我希望自己来实现这一功能,并向 InternLM 贡献代码!

[Feature] only overlap sync_grad in pp0 with pipeline parallelism

Describe the feature

image only overlap sync_grad in pp0 with pipeline parallelism

if the network is poor, sync_grad may be the main performance bottleneck, and the pipeline bubble could be huge if we overlap sync_grad with computation, since pp0 must wait the communication of other stages in current implementation

Will you implement it?

  • I would like to implement this feature and create a PR!

在安装docker环境时,总是爆出这个错误,无法解决

描述该错误

make -f docker.Makefile BASE_OS=ubuntu20.04 时,总是会出一个错误,无法解决。发生在[intrenlm-dev 3/3] RUN git submodule update --init --recursive 这一步

环境信息

ERROR: failed to solve: process "/bin/sh -c git submodule update --init --recursive && /opt/conda/bin/pip --no-cache-dir install -r requirements/torch.txt && /opt/conda/bin/pip --no-cache-dir install -r requirements/runtime.txt && cd /InternLM/third_party/flash-attention && /opt/conda/bin/python setup.py install && cd ./csrc && cd fused_dense_lib && /opt/conda/bin/pip install -v . && cd ../xentropy && /opt/conda/bin/pip install -v . && cd ../rotary && /opt/conda/bin/pip install -v . && cd ../layer_norm && /opt/conda/bin/pip install -v . && cd ../../../../ && cd ./third_party/apex && /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ && /opt/conda/bin/pip cache purge && rm -rf ~/.cache/pip" did not complete successfully: exit code: 1
make: *** [docker.Makefile:103: devel-image] Error 1

其他信息

有什么方法可以解决这问题。
所给的docker创建文件是否正确
RUN git submodule update --init --recursive 想知道这一步在docker文件的哪一部分,想要先注释掉,后续再安装

[Bug] support profiling on NPU

Describe the bug

we need to automatically switch the profiling, current we force to use activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],

Environment

PyTorch2.1

Other information

No response

[Bug] internlm docker image issue

Describe the bug

The container version of internlm has not been built successfully. After pulling the image and entering the container, it was found that it is not possible to directly use the container for training and inference. The environment inside the container also differs significantly from the actual runtime environment needed.

Environment

The container version of internlm has not been built successfully. After pulling the image and entering the container, it was found that it is not possible to directly use the container for training and inference. The environment inside the container also differs significantly from the actual runtime environment needed.

Other information

No response

[Bug] 训练报错indexSelectLargeIndex: block: [604,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

描述该错误

2024-04-19 06:06:37,071 INFO writer.py:60 in init_tb_writer -- Login tensorboard logs to: RUN/7b_internlm2_train/04-19-06.06.02/tensorboards 2024-04-19 06:06:37,761 ERROR train.py:307 in <module> -- Raise exception from c394df8c9997 with rank id: 0 Traceback (most recent call last): File "/data/InternEvo/train.py", line 305, in <module> main(args) File "/data/InternEvo/train.py", line 215, in main _, _, loss = trainer.execute_schedule( File "/data/InternEvo/internlm/core/trainer.py", line 213, in execute_schedule return self._schedule.forward_backward_step(self._engine, /data_iter, **kwargs) File "/data/InternEvo/internlm/utils/timeout.py", line 102, in wrapper result = func(*args, **kwargs) File "/data/InternEvo/internlm/core/scheduler/no_pipeline_scheduler.py", line 220, in forward_backward_step _output, _loss, _moe_loss = self._train_one_batch( File "/data/InternEvo/internlm/core/scheduler/no_pipeline_scheduler.py", line 125, in _train_one_batch output = self._call_engine(engine, /data) File "/data/InternEvo/internlm/core/scheduler/base_scheduler.py", line 86, in _call_engine return engine(**inputs) File "/data/InternEvo/internlm/core/engine.py", line 164, in __call__ return self.model(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/InternEvo/internlm/core/naive_amp.py", line 155, in forward out = self.model(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/InternEvo/internlm/model/modeling_internlm2.py", line 934, in forward hidden_states = self.tok_embeddings(input_ids) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/InternEvo/internlm/model/modules/embedding.py", line 66, in forward output = F.embedding(input_, self.weight, self.padding_idx, *self.embed_args, **self.embed_kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered Compile withTORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f97371d5617 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f973719098d in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f9737286518 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f973868a150 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f973868df78 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x7f97386a47bb in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f97386a4ac8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd6bf0 (0x7f97a7af3bf0 in /usr/local/gcc-10.2.0/lib64/libstdc++.so.6)
frame #8: + 0x8609 (0x7f97d2bff609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f97d29ca133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelect

环境信息

软件环境: ubuntu 官方镜像
硬件环境:A800

其他信息

No response

升级CUDA版本以支持Windows版本的flash-attention

描述该功能

目前能编译出来的windows版本的flash-attention是依赖cu121+py310+torch2.1
而InternEvo又只依赖cu118,导致两个库冲突了,无法在windows上训练
未来会有计划升级到cu121吗?谢谢!

是否希望自己实现该功能?

  • 我希望自己来实现这一功能,并向 InternLM 贡献代码!

[Typo] `schedulder` -> `scheduler`

Describe the question.

A typo is found when loading and saving scheduler states:

scheduler_states = llm_load(os.path.join(ckpt_path, "schedulder.pt"))

Maybe it's a better option to aggregate all constant variables into a single module and modify the file directly for global change?

https://github.com/huggingface/transformers/blob/efdd436663436e78d8ad3213d11325d86578db95/src/transformers/trainer.py#L246-L253

[Bug]当数据不够的时候,会出现StopIteration。

描述该错误

在我的数据不足以跑完整个totalstep的时候,会出现StopIteration的报错。原因是虽然在外面使用了try-except,但是在第512行的next(train_state.batch_sampler_iter仍然会出现越界,原因是train_state.batch_sampler也跟着迭代,所以即使train_state.batch_sampler_iter重新赋值,仍然会越界。

环境信息

image

其他信息

做了如下修改之后,可以跑通。
image
image

请问是否支持internLM2[QA]

Describe the question.

目前在internLM2的教程中只有XTuner版本的,InternEvo的版本尚未发布吗,请问有计划什么时候发布吗?请问InternEVO相对于XTuner的区别在哪?

[QA] 并行训练

描述问题

请问目前支持哪些并行训练方式,和训练框架和deepspeed,megatron,fsdp这些的区别和联系?谢谢

[Feature] torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors.

Describe the feature

***/evo_runner/_work/InternEvo/InternEvo/internlm/solver/optimizer/utils.py:389: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)

Will you implement it?

  • I would like to implement this feature and create a PR!

[Doc] https://arxiv.org/pdf/2401.09149.pdf Typo

📚 The doc issue

Conversely, if a = 1,InternEvo utilizes 34bSH bytes for activation storage. -> Conversely, if a = 0,InternEvo utilizes 34bSH bytes for activation storage.

Suggest a potential alternative/fix

No response

[Feature] use consist way for get device

Describe the feature

currently we use several ways to get device

  • internlm_accelerator.device()
  • internlm_accelerator.current_device()
  • from internlm.utils.common import get_current_device

we should the consist way to perform the get_device operation.

In addition, the interface " internlm_accelerator.device()" is necessary?

Will you implement it?

  • I would like to implement this feature and create a PR!

[Feature] random dataset supports to define the seq_length for generation

Describe the feature

In a lot of cases, we need full seq_length for performance test with the random_dataset. Currently, we should use with pack_sample_into_one to achieve this goal. However, it requires us to use flash_attention since it actually generates packed dataset. Thus, we need the random_dataset has the ability to generate full seq_length samples

Will you implement it?

  • I would like to implement this feature and create a PR!

[Bug] do not use torch.cuda.current_device() as device, since it only retures an int

Describe the bug

we have a lot of cases like following:

data = torch.empty(partition_size, dtype=tensor.dtype, device=torch.cuda.current_device(), requires_grad=False)

where we directly use device=torch.cuda.current_device(). However, it is not recommended to do like it, since torch.cuda.current_device() only returns device id. It is OK to run such codes on GPUs. However, maybe there are some problems when running on NPU

Environment

python3.8 + torch2.1

Other information

No response

[Feature] Partially frozen model support

Describe the feature

Current implementation does not support partially training, i.e., part of the model parameters are frozen.
If I understand it correctly, the assertion around line 584 in hybrid_zero_optim.py suggests that all parameters should be involved in the training.
image

Very looking forward this feature being implemented as there are many scenarios where users only want to finetune parts of the model.

Is there any plan on this feature?

Will you implement it?

  • I would like to implement this feature and create a PR!

[Feature] CPU synchronization Problem

Describe the feature

Some CPU synchronizations block the GPU kernel, leading to bubbles between GPU kernels. It should be optimized in the future.

  1. item() in rotary embedding.
  2. moe_loss construction.

Will you implement it?

  • I would like to implement this feature and create a PR!

[Bug] pip安装InternEvo,无法通过__version__获取版本信息

Describe the bug

执行如下代码,报错

>>> import InternEvo
>>> print(InternEvo.__version__)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'InternEvo' has no attribute '__version__'

Environment

python3.10

Other information

No response

[Feature] Support customized model size for training

描述该功能

hi there,

could you give some suggestions for training small model size, such as 1B or 3B, and related configurations?

thanks a ton!

是否希望自己实现该功能?

  • 我希望自己来实现这一功能,并向 InternLM 贡献代码!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.