aws-neuron / neuronx-distributed Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 5.0 477 KB

License: MIT No Attribution

Shell 4.57% Python 95.43%

neuronx-distributed's People

Contributors

Stargazers

Watchers

Forkers

aliseyfi ccwutw wzamazon joelamzn

neuronx-distributed's Issues

doc link broken on main README.md

The Llama inference examples needs to be updated to maintain parity with transformers==4.36

neuronx-distributed/examples/inference/llama2/neuron_modeling_llama.py

Line 36 in a80091d

LlamaDecoderLayer,

The llama inference example needs to be updated because transformers==4.36 now needs an additional argument layer_idx in the LlamaDecoderLayer class.
https://github.com/huggingface/transformers/blob/v4.37.0/src/transformers/models/llama/modeling_llama.py#L754

torch_neuronx.xla_impl.trace._trace Inconsistent with the latest torch_neuronx-2.1.1.2.0.0b0-py3-none-any.whl

This function call torch_neuronx.xla_impl.trace._trace https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/trace/trace.py#L103 seems to to be inconsistent to the function return defined on torch_neuronx-2.1.1.2.0.0b0. The function torch_neuronx.xla_impl.trace._trace defined there only return 4 values (but there are 5 return values https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/trace/trace.py#L103). This will cause the example here https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/inference/runner.py not working. Please investigate and thank you!

Error: "Backward sending grads, but get None"

Hi, I'm encountering an error Backward sending grads, but get None raised by the bwd_postprocess_task() during the model training. It seems that tensor will lose its requires_grad property after passing into this source code tensor_recv_next = xm.all_reduce(xm.REDUCE_SUM, tensor_recv_next, groups=groups) in src/neuronx_distributed/pipeline/comm.py.

This error also happens when I tried the demo Training Llama-2-13B/70B with Tensor Parallelism and Pipeline Parallelism (neuronx-distributed ) provided by the neuron document.

This is the log and compiler info:
simple.log

`2024-04-01 06:59:57.748428: W torch_xla/csrc/lowering_context.cpp:71] No custom opname metadata! op_type=xla___op_TransferWithStaticRingTransfer
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 982, in _exec_schedule
    self._exec_instr()
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 920, in _bwd_postprocess_task
    raise RuntimeError(rmsg("Backward sending grads, but get None"))
RuntimeError: [rank_8_pp1_tp0_dp0] Backward sending grads, but get None
Traceback (most recent call last):
  File "run_simple_model_nxd.py", line 289, in <module>
    _mp_fn(0, args)
  File "run_simple_model_nxd.py", line 225, in _mp_fn
    train_simple_model(args)
  File "run_simple_model_nxd.py", line 188, in train_simple_model
    loss = model.run_train(
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/trainer/model.py", line 25, in run_train
    return self.module.run_train(*args, **kwargs)
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 542, in run_train
    loss = self._run_train(**kwargs)
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 561, in _run_train
    self._exec_schedule(self.train_scheduler)
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 982, in _exec_schedule
    self._exec_instr()
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 920, in _bwd_postprocess_task
    raise RuntimeError(rmsg("Backward sending grads, but get None"))
RuntimeError: [rank_24_pp3_tp0_dp0] Backward sending grads, but get None`

Package version:

Other system details:
instance: Trn1
OS: Ubuntu 20.04

If you need other information, please let me konw. Thanks.

How can I freeze intermiadiate layers in a model ?

I am following the llama2 pre-training code.

https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/training/llama2/tp_zero1_llama2_7b_hf_pretrain/tp_zero1_llama2_7b_hf_pretrain.py

I do not understand how to freeze some of the parameters.

Clean up of old checkpoints is crashing

I'm training a model using the PyTorch Lightning plug-in and a limit on the number of kept models:

ModelCheckpoint(
                save_top_k=args.num_kept_checkpoint,
                 monitor="global_step",
                 mode="max",
                 every_n_train_steps=args.checkpoint_freq,
                 dirpath=args.checkpoint_dir,
                 enable_version_counter=False,
             )
         )

The problem is, when the limit defined in save_top_k is reached, PTL will call (at some point) lightning_fabric.plugins.io.torch_io.remove_checkpoint() https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/fabric/plugins/io/torch_io.py#L86. This is recursively removing the files under the oldest saved checkpoint:

fs = get_filesystem(path)
        if fs.exists(path):
            fs.rm(path, recursive=True)
            log.debug(f"Removed checkpoint: {path}")

but then it tries to remove an already removed checkpoint file (I'm using xser), it crashes:

 File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/usr/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
    self._run(model, ckpt_path=ckpt_path)

As you can notice, more than one process is trying to remove the same file. I think this would be just a matter of running checkpoint removal only at global rank 0 (I'm currently training using 16 nodes, with TP=8 and PP=1).

Here is relevant info about my environment:

pip freeze:

neuronx-cc==2.13.68.0+6dfecc895
neuronx-distributed==0.7.0
torch==1.13.1
torch-neuronx==1.13.1.1.14.0
torch-xla==1.13.1+torchneurone
transformers==4.31.0

Neuron libraries:

aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed]

Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE

Hi, I found the error MPMD detected but reload is not supported yet will occur if I open Eager Debug Mode for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!

I attach related scripts here and you can simply run ./run_simple_model_tp_pp.sh after download them.

scripts.zip

Environment information:

EC2 Instance: trn1.32.xlarge

OS: Ubuntu 20.04

Neuron Pytorch: Latest 2.18

aws-neuron / neuronx-distributed Goto Github PK

neuronx-distributed's People

Contributors

Stargazers

Watchers

Forkers

neuronx-distributed's Issues

doc link broken on main README.md

The Llama inference examples needs to be updated to maintain parity with transformers==4.36

torch_neuronx.xla_impl.trace._trace Inconsistent with the latest torch_neuronx-2.1.1.2.0.0b0-py3-none-any.whl

Error: "Backward sending grads, but get None"

How can I freeze intermiadiate layers in a model ?

Clean up of old checkpoints is crashing

Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs