aws-neuron / neuronx-distributed Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT No Attribution
License: MIT No Attribution
The llama inference example needs to be updated because transformers==4.36 now needs an additional argument layer_idx
in the LlamaDecoderLayer class.
https://github.com/huggingface/transformers/blob/v4.37.0/src/transformers/models/llama/modeling_llama.py#L754
This function call torch_neuronx.xla_impl.trace._trace https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/trace/trace.py#L103 seems to to be inconsistent to the function return defined on torch_neuronx-2.1.1.2.0.0b0. The function torch_neuronx.xla_impl.trace._trace defined there only return 4 values (but there are 5 return values https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/trace/trace.py#L103). This will cause the example here https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/inference/runner.py not working. Please investigate and thank you!
Hi, I'm encountering an error Backward sending grads, but get None
raised by the bwd_postprocess_task()
during the model training. It seems that tensor will lose its requires_grad property after passing into this source code tensor_recv_next = xm.all_reduce(xm.REDUCE_SUM, tensor_recv_next, groups=groups)
in src/neuronx_distributed/pipeline/comm.py.
This error also happens when I tried the demo Training Llama-2-13B/70B with Tensor Parallelism and Pipeline Parallelism (neuronx-distributed ) provided by the neuron document.
This is the log and compiler info:
simple.log
`2024-04-01 06:59:57.748428: W torch_xla/csrc/lowering_context.cpp:71] No custom opname metadata! op_type=xla___op_TransferWithStaticRingTransfer
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 982, in _exec_schedule
self._exec_instr()
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 920, in _bwd_postprocess_task
raise RuntimeError(rmsg("Backward sending grads, but get None"))
RuntimeError: [rank_8_pp1_tp0_dp0] Backward sending grads, but get None
Traceback (most recent call last):
File "run_simple_model_nxd.py", line 289, in <module>
_mp_fn(0, args)
File "run_simple_model_nxd.py", line 225, in _mp_fn
train_simple_model(args)
File "run_simple_model_nxd.py", line 188, in train_simple_model
loss = model.run_train(
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/trainer/model.py", line 25, in run_train
return self.module.run_train(*args, **kwargs)
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 542, in run_train
loss = self._run_train(**kwargs)
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 561, in _run_train
self._exec_schedule(self.train_scheduler)
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 982, in _exec_schedule
self._exec_instr()
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 920, in _bwd_postprocess_task
raise RuntimeError(rmsg("Backward sending grads, but get None"))
RuntimeError: [rank_24_pp3_tp0_dp0] Backward sending grads, but get None`
Other system details:
instance: Trn1
OS: Ubuntu 20.04
If you need other information, please let me konw. Thanks.
I am following the llama2 pre-training code.
I do not understand how to freeze some of the parameters.
I'm training a model using the PyTorch Lightning plug-in and a limit on the number of kept models:
ModelCheckpoint(
save_top_k=args.num_kept_checkpoint,
monitor="global_step",
mode="max",
every_n_train_steps=args.checkpoint_freq,
dirpath=args.checkpoint_dir,
enable_version_counter=False,
)
)
The problem is, when the limit defined in save_top_k
is reached, PTL will call (at some point) lightning_fabric.plugins.io.torch_io.remove_checkpoint()
https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/fabric/plugins/io/torch_io.py#L86. This is recursively removing the files under the oldest saved checkpoint:
fs = get_filesystem(path)
if fs.exists(path):
fs.rm(path, recursive=True)
log.debug(f"Removed checkpoint: {path}")
but then it tries to remove an already removed checkpoint file (I'm using xser), it crashes:
File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/usr/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
self._run(model, ckpt_path=ckpt_path)
As you can notice, more than one process is trying to remove the same file. I think this would be just a matter of running checkpoint removal only at global rank 0 (I'm currently training using 16 nodes, with TP=8 and PP=1).
Here is relevant info about my environment:
pip freeze:
neuronx-cc==2.13.68.0+6dfecc895
neuronx-distributed==0.7.0
torch==1.13.1
torch-neuronx==1.13.1.1.14.0
torch-xla==1.13.1+torchneurone
transformers==4.31.0
Neuron libraries:
aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed]
Hi, I found the error MPMD detected but reload is not supported yet
will occur if I open Eager Debug Mode
for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!
I attach related scripts here and you can simply run ./run_simple_model_tp_pp.sh
after download them.
Environment information:
EC2 Instance: trn1.32.xlarge
OS: Ubuntu 20.04
Neuron Pytorch: Latest 2.18
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.