aws-neuron / aws-neuron-samples Goto Github PK
View Code? Open in Web Editor NEWExample code for AWS Neuron SDK developers building inference and training applications
License: Other
Example code for AWS Neuron SDK developers building inference and training applications
License: Other
What is the samples per second and sequence per second of the bert model ran on Trainium 32x?
I am getting following error when importing xla_backend on torch-neuronx 2.1
## PackagesException has occurred: FileNotFoundError
[Errno 2] No such file or directory: 'libneuronpjrt-path'
File "/home/ubuntu/efs/git/gpt2-fsdp/policies/wrapping.py", line 5, in <module>
import torch_xla.distributed.xla_backend
File "/home/ubuntu/efs/git/gpt2-fsdp/policies/__init__.py", line 2, in <module>
from .wrapping import *
File "/home/ubuntu/efs/git/gpt2-fsdp/train_fsdp.py", line 32, in <module>
import policies
FileNotFoundError: [Errno 2] No such file or directory: 'libneuronpjrt-path'
import torch_xla.distributed.xla_backend
dpkg --list | grep neuron
ii aws-neuronx-collectives 2.20.11.0-c101c322e amd64 neuron_ccom built using CMake
ii aws-neuronx-dkms 2.15.9.0 amd64 aws-neuronx driver in DKMS format.
ii aws-neuronx-oci-hook 2.2.45.0 amd64 neuron_oci_hook built using CMake
ii aws-neuronx-runtime-lib 2.20.11.0-b7d33e68b amd64 neuron_runtime built using CMake
ii aws-neuronx-tools 2.17.0.0 amd64 Neuron profile and debug tools
aws-neuronx-runtime-discovery==2.9
libneuronxla==2.0.755
neuronx-cc==2.12.68.0+4480452af
neuronx-hwm==2.12.0.0+422c9037c
torch-neuronx==2.1.1.2.0.1b0
PJRT_DEVICE=NEURON
trn1.32xlarge
uname -a
Linux ip-172-31-53-77 6.2.0-1017-aws #17~22.04.1-Ubuntu SMP Fri Nov 17 21:07:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Running notebook - https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb on inf2.48xlarge
Error while running last block - line no 4
from transformers_neuronx.llama.model import LlamaForSampling
results in:
>>> from transformers_neuronx.llama.model import LlamaForSampling
2023-Sep-27 06:59:32.0474 22340:22340 ERROR TDRV:tdrv_get_dev_info No neuron device available
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/llama/model.py", line 17, in <module>
from transformers_neuronx import decoder
File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/decoder.py", line 18, in <module>
from transformers_neuronx import compiler
File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/compiler.py", line 33, in <module>
from libneuronxla import neuron_xla_compile
ImportError: cannot import name 'neuron_xla_compile' from 'libneuronxla' (/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/libneuronxla/__init__.py)
huggingface/optimum-neuron#213 - This suggests to update latest version of torch-neuronx. And aws-neuron/transformers-neuronx#33 this suggest specific to torch-neuronx-1.13.1.1.10.0
When tried installing the specific version, it failed with following exception.
python -m pip install torch-neuronx==1.13.1.1.10.0 -U
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting torch-neuronx==1.13.1.1.10.0
Using cached https://pip.repos.neuron.amazonaws.com/torch-neuronx/torch_neuronx-1.13.1.1.10.0-py3-none-any.whl (2.4 MB)
Requirement already satisfied: torch==1.13.* in ./aws_neuron_venv_pytorch/lib/python3.7/site-packages (from torch-neuronx==1.13.1.1.10.0) (1.13.1)
INFO: pip is looking at multiple versions of torch-neuronx to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement torch-xla==1.13.1+torchneurona (from torch-neuronx) (from versions: 1.0, 1.11.0+torchneuron2, 1.11.0+torchneuron3, 1.12.0+torchneuron3, 1.13.0+torchneuron3, 1.13.0+torchneuron4, 1.13.0+torchneuron5, 1.13.1+torchneuron6, 1.13.1+torchneuron7, 1.13.1+torchneuron8)
ERROR: No matching distribution found for torch-xla==1.13.1+torchneurona
Additional info on different versions available as of now.
pip index versions torch-neuronx
WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
torch-neuronx (1.13.1.1.11.0)
Available versions: 1.13.1.1.11.0, 1.13.1.1.10.1, 1.13.1.1.10.0, 1.13.1.1.9.1, 1.13.1.1.9.0, 1.13.1.1.8.0, 1.13.1.1.7.0, 1.13.0.1.6.1, 1.13.0.1.6.0, 1.13.0.1.5.0, 1.13.0.1.4.0, 1.12.0.1.4.0, 1.11.0.1.2.0, 1.11.0.1.1.1, 1.0
INSTALLED: 1.13.1.1.9.1
LATEST: 1.13.1.1.11.0
pip index versions torch-xla
WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
torch-xla (1.13.1+torchneuron8)
Available versions: 1.13.1+torchneuron8, 1.13.1+torchneuron7, 1.13.1+torchneuron6, 1.13.0+torchneuron5, 1.13.0+torchneuron4, 1.13.0+torchneuron3, 1.12.0+torchneuron3, 1.11.0+torchneuron3, 1.11.0+torchneuron2, 1.0
INSTALLED: 1.13.1+torchneuron8
LATEST: 1.13.1+torchneuron8
Following packages are installed
anyio==3.7.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
attrs==23.1.0
aws-neuronx-runtime-discovery==2.9
awscli==1.29.54
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.0.0
boto3==1.28.54
botocore==1.31.54
cached-property==1.5.2
cachetools==5.3.1
certifi==2023.7.22
cffi==1.15.1
charset-normalizer==3.2.0
cloud-tpu-client==0.10
colorama==0.4.4
comm==0.1.4
debugpy==1.7.0
decorator==5.1.1
defusedxml==0.7.1
docutils==0.16
ec2-metadata==2.10.0
entrypoints==0.4
environment-kernels==1.2.0
exceptiongroup==1.1.3
fastjsonschema==2.18.0
filelock==3.12.2
fsspec==2023.1.0
google-api-core==1.34.0
google-api-python-client==1.8.0
google-auth==2.23.0
google-auth-httplib2==0.1.1
googleapis-common-protos==1.60.0
httplib2==0.22.0
huggingface-hub==0.16.4
idna==3.4
importlib-metadata==6.7.0
importlib-resources==5.12.0
iniconfig==2.0.0
ipykernel==6.16.2
ipython==7.34.0
ipython-genutils==0.2.0
ipywidgets==8.1.1
islpy==2022.2.1
jedi==0.19.0
Jinja2==3.1.2
jmespath==1.0.1
jsonschema==4.17.3
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-server==1.24.0
jupyter_client==7.4.9
jupyter_core==4.12.0
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.9
libneuronxla==0.5.413
lockfile==0.12.2
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mistune==3.0.1
nbclassic==1.0.0
nbclient==0.7.4
nbconvert==7.6.0
nbformat==5.8.0
nest-asyncio==1.5.8
networkx==2.6.3
neuronx-cc==2.9.0.16+fa12ba55a
neuronx-hwm==2.9.0.1+f79d59e7b
notebook==6.5.6
notebook_shim==0.2.3
numpy==1.21.6
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauth2client==4.1.3
packaging==23.1
pandocfilters==1.5.0
parso==0.8.3
pexpect==4.8.0
pgzip==0.3.5
pickleshare==0.7.5
Pillow==9.5.0
pkgutil_resolve_name==1.3.10
pluggy==1.2.0
prometheus-client==0.17.1
prompt-toolkit==3.0.39
protobuf==3.20.3
psutil==5.9.5
ptyprocess==0.7.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
Pygments==2.16.1
pyparsing==3.1.1
pyrsistent==0.19.3
pytest==7.4.2
python-daemon==3.0.1
python-dateutil==2.8.2
PyYAML==6.0.1
pyzmq==24.0.1
qtconsole==5.4.4
QtPy==2.4.0
regex==2023.8.8
requests==2.31.0
requests-unixsocket==0.3.0
rsa==4.7.2
s3transfer==0.6.2
safetensors==0.3.3
scipy==1.7.3
Send2Trash==1.8.2
sentencepiece==0.1.99
six==1.16.0
sniffio==1.3.0
soupsieve==2.4.1
terminado==0.17.1
tinycss2==1.2.1
tokenizers==0.13.3
tomli==2.0.1
torch==1.13.1
torch-neuronx==1.13.1.1.9.1
torch-xla==1.13.1+torchneuron8
torchvision==0.14.1
tornado==6.2
tqdm==4.66.1
traitlets==5.9.0
transformers==4.30.2
transformers-neuronx==0.7.84
typing_extensions==4.7.1
uritemplate==3.0.1
urllib3==1.26.16
wcwidth==0.2.6
webencodings==0.5.1
websocket-client==1.6.1
wget==3.2
widgetsnbextension==4.0.9
zipp==3.15.0
Could successfully compile llama2 7B on neuronx.
Referring to this notebook-
https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb
But for llama2 70B, getting this error for LlamaForSampling.from_pretrained step-
ValueError: Weight with shape torch.Size([8192, 1024]) cannot be sharded along dimension 1. This results in 21 weight partitions which cannot be distributed to 20 NeuronCores evenly. To fix this issue either the model parameters or the tp_degree
must be changed to allow the weight to be evenly split
Downloaded "meta-llama/Llama-2-70b-hf" model from huggingface and pointing to directory 'models--meta-llama--Llama-2-70b-hf/snapshots/' for the above step.
I have changed the batch sizes of the trace tensor inputs in hf_pretrained_sd2_512_inference.ipynb notebook. Although
text encoder, unet and vae_post_quant_conv were compiled, vae wasn't compiled.
batch=2
import torch_neuronx
from diffusers import StableDiffusionPipeline
import torch
import os, copy
COMPILER_WORKDIR_ROOT = 'sd2_compile_dir_512_batch2'
model_id = "stabilityai/stable-diffusion-2-1-base"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
decoder = copy.deepcopy(pipe.vae.decoder)
del pipe
decoder_in = torch.randn([2, 4, 64, 64])
decoder_neuron = torch_neuronx.trace(
decoder,
decoder_in,
compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder'),
)
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
torch.jit.save(decoder_neuron, decoder_filename)
del decoder
del decoder_neuron
I get error message:
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 234016.96it/s]
Selecting 161763 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 137856 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 272047 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 52275 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 318165 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 8981 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 323589 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
2023-05-22T06:51:17Z WARNING 28201 [SB_Allocator]: couldn't allocate every tensor in SB
2023-05-22T06:51:17Z WARNING 28201 [SB_Allocator]: disabling special handling of accumulation groups
Selecting 323589 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 2233 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 325190 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: couldn't allocate every tensor in SB and spilling can't help
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: 10 biggest memlocs:
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_312_pftranspose_5198_i6_ReloadStore32338_ReloadStore166495 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]:
mhlo_add_312_pftranspose_5198_i0_ReloadStore32560_Remat_166496 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i5_ReloadStore32107_Remat_166430 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_259_i0 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i7 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i1 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i6 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i5_ReloadStore32107_Remat_166431 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i7_ReloadStore32024_Remat_121920_Remat_166327 65536
I have used inf2.8xlarge instance and set 100GB swap space. Any ideas on this batch input compilation problem?
prompt = '''Translate English to French:
starfish => étoile de mer
campfire => feu de camp
snowflake => flocon de neige
dragonfly => libellule
maple tree => érable
thunderstorm => orage
seashell => coquillage
waterfall => cascade
hummingbird => colibri
pine cone => pomme de pin
lighthouse => phare
dandelion => pissenlit
cheese =>
'''
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.inference_mode():
start = time.time()
generated_sequences = neuron_model.sample(input_ids, temperature= 0.1, sequence_length=200, top_p=0.9)
elapsed = time.time() - start
generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')
I am expecting the translation of cheese (fromage) as the output. But instead getting the entire prompt as output.
What is the parallel parameter in neuron for return_full_text=False etc? This prompt works well in llama playground but not on neuron. I don't want to generate paragraphs in the output, instead looking to use this for text extraction task.
It would be useful to add some information about how to obtain the models from Hugging Face and in particular:
I'm trying to do inference for TrOCR on Inf1
instance. Able to compile and save model as per the notebook but the model execution is happening on CPU right now. Neuron core are unutilized. Please provide a way so that the inference takes use of neuron cores.
import torch
import torch.neuron
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-small-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-small-handwritten").eval()
max_length = 32
input_ids=torch.zeros([1, max_length], dtype=torch.int64)
attention_mask=torch.zeros([1, max_length], dtype=torch.int64)
encoder_hidden_states=torch.rand([1, 578, 384])
pad_size = torch.as_tensor(0)
xenc = torch.rand(1,3,384,384).float()
xdec = (input_ids, attention_mask, encoder_hidden_states, pad_size)
model.encoder.forward_neuron = torch.jit.load('troc_encoder_neuron.pt')
model.decoder.forward_neuron = torch.jit.load('troc_decoder_neuron.pt')
generated_ids = model.generate(xenc, pad_token_id=model.config.decoder.eos_token_id)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
%matplotlib inline
import os
import sys
import cv2
import urllib
import matplotlib.pyplot as plt
import time
if not '..' in sys.path: sys.path.append('..')
def load_sample_imgE():
if not os.path.exists("text.jpg"):
urllib.request.urlretrieve("https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg", "text.jpg")
return cv2.imread("text.jpg")
max_len = 32
img = load_sample_imgE()
for i in range(10):
pixel_values = processor(img, max_length=max_length, padding='max_length',
truncate=True, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values, pad_token_id=model.config.decoder.eos_token_id, max_length=max_len)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
t1 = time.time()
num_inf = 100
for i in range(num_inf):
pixel_values = processor(img, max_length=max_length, padding='max_length',
truncate=True, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values, pad_token_id=model.config.decoder.eos_token_id, max_length=max_len)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
t2 = time.time()
print(f"Inf/sec: {num_inf/(t2-t1):.2f}")
print(generated_text)
plt.figure(figsize=(10,5))
plt.imshow(img)
Hello Everyone,
I am trying to follow the directions in https://aws.amazon.com/blogs/machine-learning/maximize-stable-diffusion-performance-and-lower-inference-costs-with-aws-inferentia2/. I am not sure what I am doing wrong and would love some help! Thanks in advance!
My environment looks as follows:
instance: inf2.8xlarge
ami: aws ec2 describe-images --region us-west-2 --owners amazon --filters 'Name=name,Values=Deep Learning AMI Neuron PyTorch 1.13.? (Ubuntu 20.04) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text
> source /opt/aws_neuron_venv_pytorch/bin/activate
> jupyter nbconvert --to script hf_pretrained_sd2_512_inference.ipynb
> cp hf_pretrained_sd2_512_inference.py seth_test.py
> python seth_test.py
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 210524.91it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ubuntu/Developer/run/seth_test.py:189 in <module> │
│ │
│ 186 encoder_hidden_states_1b = torch.randn([1, 77, 1024], dtype=DTYPE) │
│ 187 example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b │
│ 188 │
│ ❱ 189 unet_neuron = torch_neuronx.trace( │
│ 190 │ unet, │
│ 191 │ example_inputs, │
│ 192 │ compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'), │
│ │
│ /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:265 in │
│ trace │
│ │
│ 262 │ │ hlo_filename = os.path.join(model_dir, 'graph.hlo') │
│ 263 │ │ │
│ 264 │ │ # Write weights to disk │
│ ❱ 265 │ │ weight_paths = write_params(model_dir, constant_parameter_tensors) │
│ 266 │ │ │
│ 267 │ │ table = { │
│ 268 │ │ │ "model_files": "graph.hlo", │
│ │
│ /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:306 in │
│ write_params │
│ │
│ 303 │ │
│ 304 │ # Write tensor data to disk │
│ 305 │ for name, weight in weights.items(): │
│ ❱ 306 │ │ np.save(f'{directory}/weights/{name}.npy', weight.numpy()) │
│ 307 │ │
│ 308 │ # Write mapping file. Paths are relative to the directory │
│ 309 │ weight_paths = {name: f'weights/{name}.npy' for name in weights} │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: Got unsupported ScalarType BFloat16
Python Dependencies:
(aws_neuron_venv_pytorch) ubuntu@ip-172-31-1-65:~/Developer/run$ pip freeze
absl-py==1.4.0
accelerate==0.16.0
aiofiles==22.1.0
aiohttp==3.8.4
aiosignal==1.3.1
aiosqlite==0.19.0
amqp==5.1.1
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
astroid==2.15.4
asttokens==2.2.1
async-timeout==4.0.2
attrs==23.1.0
Automat==22.10.0
aws-neuronx-runtime-discovery==2.9
awscli==1.27.126
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.12.2
billiard==3.6.4.0
bleach==6.0.0
boto3==1.26.126
botocore==1.29.126
build==0.10.0
cachetools==5.3.0
celery==5.2.7
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.1.0
click==8.1.3
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.2.0
cloud-tpu-client==0.10
cloudpickle==2.2.1
cmake==3.26.3
colorama==0.4.4
comm==0.1.3
constantly==15.1.0
contourpy==1.0.7
cryptography==40.0.2
cssselect==1.2.0
cycler==0.11.0
dask==2023.4.1
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
diffusers==0.14.0
dill==0.3.6
distlib==0.3.6
docutils==0.16
dparse==0.6.2
exceptiongroup==1.1.1
executing==1.2.0
fastapi==0.95.1
fastjsonschema==2.16.3
filelock==3.12.0
fonttools==4.39.3
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2023.4.0
google-api-core==1.34.0
google-api-python-client==1.8.0
google-auth==2.17.3
google-auth-httplib2==0.1.0
googleapis-common-protos==1.59.0
httpie==3.2.1
httplib2==0.22.0
huggingface-hub==0.14.1
hyperlink==21.0.0
idna==3.4
imageio==2.28.1
importlib-metadata==6.6.0
importlib-resources==5.12.0
incremental==22.10.0
iniconfig==2.0.0
install==1.3.5
ipykernel==6.22.0
ipython==8.12.2
ipython-genutils==0.2.0
ipywidgets==8.0.6
islpy==2022.1.1
isoduration==20.11.0
isort==5.12.0
itemadapter==0.8.0
itemloaders==1.1.0
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
json5==0.9.11
jsonpointer==2.3
jsonschema==4.17.3
jupyter-events==0.6.3
jupyter-ydoc==0.2.4
jupyter_client==8.2.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_fileid==0.9.0
jupyter_server_terminals==0.4.4
jupyter_server_ydoc==0.8.0
jupyterlab==3.6.3
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
jupyterlab_server==2.22.1
kiwisolver==1.4.4
kombu==5.2.4
lazy-object-proxy==1.9.0
libneuronxla==0.5.205
llvmlite==0.40.0
locket==1.0.0
lockfile==0.12.2
lxml==4.9.2
markdown-it-py==2.2.0
MarkupSafe==2.1.2
matplotlib==3.7.1
matplotlib-inline==0.1.6
mccabe==0.7.0
mdurl==0.1.2
mistune==2.0.5
multidict==6.0.4
nbclassic==1.0.0
nbclient==0.7.4
nbconvert==7.3.1
nbformat==5.8.0
nest-asyncio==1.5.6
networkx==2.6.3
neuronx-cc==2.6.0.19+3d819e565
neuronx-hwm==2.6.0.0+826e77395
notebook==6.5.4
notebook_shim==0.2.3
numba==0.57.0
numpy==1.21.6
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauth2client==4.1.3
opencv-python==4.7.0.72
packaging==21.3
pandas==2.0.1
pandocfilters==1.5.0
parsel==1.8.1
parso==0.8.3
partd==1.4.0
pexpect==4.8.0
pgzip==0.3.4
pickleshare==0.7.5
Pillow==9.5.0
pip-tools==6.13.0
pipenv==2023.2.4
pkg_resources==0.0.0
pkgutil_resolve_name==1.3.10
platformdirs==3.5.0
plotly==5.14.1
pluggy==1.0.0
prometheus-client==0.16.0
prompt-toolkit==3.0.38
Protego==0.2.1
protobuf==3.20.3
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==1.10.7
PyDispatcher==2.0.7
Pygments==2.15.1
pylint==2.17.3
pyOpenSSL==23.1.1
pyparsing==3.0.9
pyproject_hooks==1.0.0
pyrsistent==0.19.3
PySocks==1.7.1
pytest==7.3.1
python-daemon==3.0.1
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3
PyYAML==5.4.1
pyzmq==25.0.2
queuelib==1.6.2
regex==2023.5.5
requests==2.29.0
requests-file==1.5.1
requests-toolbelt==1.0.0
requests-unixsocket==0.3.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.3.5
rsa==4.7.2
ruamel.yaml==0.17.22
ruamel.yaml.clib==0.2.7
s3transfer==0.6.0
safetensors==0.3.1
scikit-learn==1.2.2
scipy==1.7.3
Scrapy==2.8.0
seaborn==0.12.2
Send2Trash==1.8.2
service-identity==21.1.0
shap==0.41.0
six==1.16.0
slicer==0.0.7
sniffio==1.3.0
soupsieve==2.4.1
stack-data==0.6.2
starlette==0.26.1
tenacity==8.2.2
terminado==0.17.1
threadpoolctl==3.1.0
tinycss2==1.2.1
tldextract==3.4.1
tokenizers==0.13.3
toml==0.10.2
tomli==2.0.1
tomlkit==0.11.8
toolz==0.12.0
torch==1.13.1
torch-neuronx==1.13.0.1.6.1
torch-xla==1.13.0+torchneuron5
torchvision==0.14.0
tornado==6.3.1
tqdm==4.65.0
traitlets==5.9.0
transformers==4.30.2
Twisted==22.10.0
typing_extensions==4.5.0
tzdata==2023.3
uri-template==1.2.0
uritemplate==3.0.1
urllib3==1.26.15
vine==5.0.0
virtualenv==20.23.0
virtualenv-clone==0.5.7
w3lib==2.1.1
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.1
widgetsnbextension==4.0.7
wrapt==1.15.0
y-py==0.5.9
yarl==1.9.2
ypy-websocket==0.8.2
zipp==3.15.0
zope.interface==6.0
When running the "Compile the model into an optimized TorchScript and save the TorchScript" step on the torch-neuronx/inference/hf_pretrained_sd2_512_inference.ipynb and got the error of "Got unsupported ScalarType BFloat16"
I am trying to compile the MarianMT language translation model for Inf1
instance.
kernel version = 5.4.228-131.415.amzn2.x86_64
Instance type on which the compilation was attempted = Inf1.2xlarge, Amazon Linux 2 AMI,
Following is my pip freeze
output
# pip freeze
torch==1.7.1
torch-neuron==1.7.1.2.5.8.0
transformers==4.0.1
tensorflow==1.15.5
sentencepiece==0.1.97
absl-py==1.4.0
astor==0.8.1
attrs==22.2.0
certifi==2022.12.7
charset-normalizer==3.0.1
click==8.1.3
decorator==5.1.1
dmlc-nnvm==1.13.0.0+0
dmlc-topi==1.13.0.0+0
dmlc-tvm==1.13.0.0+0
exceptiongroup==1.1.0
filelock==3.9.0
gast==0.2.2
google-pasta==0.2.0
grpcio==1.51.1
h5py==2.10.0
idna==3.4
importlib-metadata==6.0.0
inferentia-hwm==1.13.0.0+0
iniconfig==2.0.0
islpy==2021.1+aws2021.x.80.0.bld0
joblib==1.2.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.2
Markdown==3.4.1
MarkupSafe==2.1.2
networkx==2.5
neuron-cc==1.13.5.0+7dcf000a6
numpy==1.18.5
opt-einsum==3.3.0
packaging==23.0
Pillow==9.4.0
pluggy==1.0.0
protobuf==3.20.1
pytest==7.2.1
regex==2022.10.31
requests==2.28.2
sacremoses==0.0.53
scipy==1.4.1
six==1.16.0
tensorboard==1.15.0
tensorflow-estimator==1.15.1
termcolor==2.2.0
tokenizers==0.9.4
tomli==2.0.1
tqdm==4.64.1
typing_extensions==4.5.0
urllib3==1.26.14
Werkzeug==2.2.3
wrapt==1.14.1
zipp==3.13.0
I followed this link for
[PyTorch installation].(https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuron/setup/pytorch-install.html)
Then I followed the instructions in
this notebook
I get the following compilation error
I'm trying to compile Bart for text2text generation on an Inf2 server. I am aware that optimum-neuron has a Bart implementation, but I need to be able to make customizations that are incompatible with the pipeline system.
Bart is implemented so that if you pass past_key_values, you can provide only the last decoder input ID rather than the whole string. This speeds up the attention, so that it's linear per step rather than quadratic time, because it only has to run for one position rather than all positions so far. This is an important compute optimisation.
When I try to trace a call to Bart that uses this optimisation, I get an error:
2023-07-04T12:38:29Z ERROR 26864 [Tensorizer]: Transformation error on operator: mlir.function
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: ***************************************************************
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: An Internal Compiler Error has occurred
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: ***************************************************************
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]:
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: Error message: too many values to unpack (expected 1)
Steps to reproduce:
from transformers import BartForConditionalGeneration, BartTokenizerFast, BartConfig
import copy
import torch
import torch.nn.functional as F
import torch_neuronx
import transformers
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base', torchscript=True)
example_sentence = "Hello, my name is Billy."
tokeniser = BartTokenizerFast.from_pretrained('facebook/bart-base')
tokens = tokeniser(example_sentence, return_tensors='pt')
inputs = (None,None, None, None, None, None, None, (torch.zeros((1, 128, 768)),), None, None, torch.zeros((1, 9, 768)))
outputs = model(*inputs)
class BartForNeuronDecoder(torch.nn.Module):
def __init__(self, bart):
super().__init__()
self.bart = bart
def forward(
self,
decoder_input_ids, # 1 token per batch
encoder_outputs,
attention_mask, # for encoder outputs
past_key_values, # max_len - 1
):
outputs = self.bart.model(
encoder_outputs=encoder_outputs,
attention_mask=attention_mask,
decoder_input_ids=decoder_input_ids,
past_key_values=past_key_values,
use_cache=True,
)
lm_logits = self.bart.lm_head(outputs[0]) + self.bart.final_logits_bias
return (
lm_logits,
outputs[1]
)
wrapped_model = BartForNeuronDecoder(model)
def pad_key_values(past_key_values, max_len):
padded_key_values = ()
for layer in past_key_values:
padded_layer = ()
for i in [0,1]:
padded_layer = padded_layer + (F.pad(layer[i], pad=(0,0,0, max_len - 1 - layer[i].shape[2])),)
padded_layer = padded_layer + layer[2:]
padded_key_values = padded_key_values + (padded_layer,)
return padded_key_values
pkv = pad_key_values(outputs[1], 128)
args = (torch.tensor([[0]]), (torch.zeros((1, 128, 768)),), torch.tensor([[1,1,1,1,1] + [0]*123]), pkv)
wrapped_model_neuron = torch_neuronx.trace(wrapped_model, args)
Let me know if you have trouble reproducing it and need additional details.
Many thanks.
Hi, I am exporting a model using torch neuron but I can't find any reference to save a custom attribute in the model.
For instance I would like to save the dimension of input image as an integer so to be able to get again this value doing something like:
model = torch.jit.load('"/path/to/my/model.pt")
image_size = model.image_size
TypeError Traceback (most recent call last)
Cell In[5], line 35
24 prompt = ["a photo of an astronaut riding a horse on mars",
25 "sonic on the moon",
26 "elvis playing guitar while eating a hotdog",
(...)
31 "kids playing soccer at the FIFA World Cup"
32 ]
34 # First do a warmup run so all the asynchronous loads can finish
---> 35 image_warmup = pipe(prompt[0]).images[0]
37 plt.title("Image")
38 plt.xlabel("X pixel scaling")
File /opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File /opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py:1174, in StableDiffusionXLPipeline.call(self, prompt, prompt_2, height, width, num_inference_steps, timesteps, denoising_end, guidance_scale, negative_prompt, negative_prompt_2, num_images_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds, ip_adapter_image, ip_adapter_image_embeds, output_type, return_dict, cross_attention_kwargs, guidance_rescale, original_size, crops_coords_top_left, target_size, negative_original_size, negative_crops_coords_top_left, negative_target_size, clip_skip, callback_on_step_end, callback_on_step_end_tensor_inputs, **kwargs)
1172 if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
1173 added_cond_kwargs["image_embeds"] = image_embeds
-> 1174 noise_pred = self.unet(
1175 latent_model_input,
1176 t,
1177 encoder_hidden_states=prompt_embeds,
1178 timestep_cond=timestep_cond,
1179 cross_attention_kwargs=self.cross_attention_kwargs,
1180 added_cond_kwargs=added_cond_kwargs,
1181 return_dict=False,
1182 )[0]
1184 # perform guidance
1185 if self.do_classifier_free_guidance:
File /opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
TypeError: NeuronUNet.forward() got an unexpected keyword argument 'timestep_cond'
When executed "hf_pretrained_sdxl_base_1024_inference" then process will failed at "torch.jit.save(unet_neuron, unet_filename)" and the kernel will dead to save the file.
Hi team,
I am adapting this notebook, essentially instantiating a ControlNet pipe, such as
controlnet = ControlNetModel.from_pretrained("DionTimmer/controlnet_qrcode-control_v1p_sd15",
torch_dtype=torch.float16)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
controlnet=controlnet,
safety_checker=None,
torch_dtype=torch.float16
)
and then going for torch_neuronx.trace
I suspect the number 1 blocker is the fact that the main libs the notebooks suggest to install
!pip install diffusers==0.14.0 transformers==4.30.2 accelerate==0.16.0 safetensors==0.3.1 matplotlib
are too "old" for ControlNet. For instance, transformers
and accelerate
need to be upgraded.
This prompts the update of other dependencies and then I end up with an env that is completely different that the originally recommended one.
This causes, for instance, torch complaining about CUDA (among other things), whereas we are on Inf2 (this is confusing).
Tried multiple times but somehow couldn't get very far.
I also tried with the latest Neuron release and I can't get it to work.
Any help would be massively appreciated!
What are the best ways to deploy the above model for fast inference from local machine and also support parallel requests?
Got the run_clm.py to compile on trn1.32xlarge and also run the actual training. However, it shows loss-NaN and perplexily NaN results.
has this been observed? The directions I followed are from here
/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/numpy/core/_methods.py:178: RuntimeWarning: invalid value encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
^M100%|██████████| 2/2 [00:00<00:00, 2.55it/s]
***** eval metrics *****
epoch = 3.0
eval_loss = nan
eval_runtime = 0:00:07.21
eval_samples = 240
eval_samples_per_second = 33.28
eval_steps_per_second = 0.277
perplexity = nan
Samples shell scripts' variables confuse devices and cores. A Trn1
instances has 16 Neuron Devices (chips), each with 2 cores.
This sample script, on Line 31 shows:
export NEURON_NUM_DEVICES=32
I think, the correct code would be:
export NEURON_NUM_CORES=32
The deprecated Neuron Megatron example script shows it correctly:
NUM_NEURONCORES=32
Here is a cleaned up GitHub issue request:
I followed the Llama NeuronX tutorial to host Llama2 on Amazon EC2 with NeuronX and TorchServe. The model works well, achieving 50+ tokens/sec as expected.
Issue
However, for my use case the input contexts are 500-3000 tokens. When I provide an example 3000 token context, there is a 10-30 second overhead before the first token is generated. After the first token, the inference speed is 50 tok/sec as expected.
Attempted fixes
I have tried the following to resolve the long context overhead:
maxWorkers
, maxBatchDelay
, batchSize
- no improvementmax_length
parameter to support longer sequences - no improvementmicro_batch_size
and parallelism values - no improvementmodel-config.yaml
minWorkers: 2
maxWorkers: 8 #did not help
maxBatchDelay: 20
responseTimeout: 1080
batchSize: 4 #did not help
handler:
model_checkpoint_dir: "llama-2-13b-split"
amp: "bf16"
tp_degree: 6
max_length: 100
#did not help either
# micro_batching:
# micro_batch_size: 8
# parallelism:
# preprocess: 4
# inference: 1
# postprocess: 4
pip list
torch 1.13.1+cpu
torch-model-archiver 0.9.0b20231026
torch-neuronx 1.13.1.1.12.1
torch-workflow-archiver 0.2.11b20231026
torch-xla 1.13.1+torchneuronc
transformers-neuronx 0.8.268
torchserve --ncs --start --model-store model_store --ts-config config.properties --models llama-2-13b
(aws_neuron_venv_pytorch) ubuntu@ip-10-72-158-249:~/serve/examples/large_models/inferentia2/llama2$ WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-12-04T23:54:37,499 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2023-12-04T23:54:37,501 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-12-04T23:54:37,545 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml
2023-12-04T23:54:37,683 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.9.0
TS Home: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages
Current directory: /home/ubuntu/serve/examples/large_models/inferentia2/llama2
Temp directory: /tmp
Metrics config path: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml
Number of GPUs: 0
Number of CPUs: 96
Max heap size: 30688 M
Python executable: /opt/aws_neuron_venv_pytorch/bin/python
Config file: config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store
Initial Models: llama-2-13b
Log dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs
Metrics dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 96
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: log
Disable system metrics: false
Workflow Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store
Model config: N/A
2023-12-04T23:54:37,689 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
2023-12-04T23:54:37,703 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: llama-2-13b
2023-12-04T23:54:37,709 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5
2023-12-04T23:54:37,710 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5/llama-2-13b
2023-12-04T23:54:37,718 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model llama-2-13b
2023-12-04T23:54:37,719 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model llama-2-13b
2023-12-04T23:54:48,067 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model llama-2-13b loaded.
2023-12-04T23:54:48,067 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: llama-2-13b, count: 2
2023-12-04T23:54:48,074 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2023-12-04T23:54:48,074 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9001, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2023-12-04T23:54:48,075 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2023-12-04T23:54:48,272 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40732955932617|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85419082641602|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:364036.0625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:12472.20703125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:3.9|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9000, pid=492260
2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000
2023-12-04T23:54:48,779 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9001, pid=492261
2023-12-04T23:54:48,780 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9001
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492261
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-04T23:54:48,787 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2023-12-04T23:54:48,787 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change null -> WORKER_STARTED
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492260
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-04T23:54:48,788 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10
2023-12-04T23:54:48,788 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change null -> WORKER_STARTED
2023-12-04T23:54:48,790 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2023-12-04T23:54:48,790 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2023-12-04T23:54:48,797 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9001.
2023-12-04T23:54:48,797 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000.
2023-12-04T23:54:48,799 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799
2023-12-04T23:54:48,799 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799
2023-12-04T23:54:48,833 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8
2023-12-04T23:54:48,833 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8
2023-12-04T23:54:48,997 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2023-12-04T23:54:49,000 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2023-12-04T23:54:49,523 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile
2023-12-04T23:54:49,532 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile
2023-12-04T23:54:49,543 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-04T23:54:49,544 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-04T23:54:49,545 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1
2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-04T23:54:49,555 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1
2023-12-04T23:54:58,772 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model
2023-12-04T23:54:58,789 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model
2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40731430053711|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85420608520508|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:342452.0390625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:34056.08203125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:9.6|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:56:01,531 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled
2023-12-04T23:56:01,537 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 72704
2023-12-04T23:56:01,538 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:73466.0|#WorkerName:W-9000-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161
2023-12-04T23:56:01,539 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161
2023-12-04T23:56:02,630 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled
2023-12-04T23:56:02,632 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 73799
2023-12-04T23:56:02,633 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:74560.0|#WorkerName:W-9001-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162
2023-12-04T23:56:02,634 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40730667114258|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.8542137145996|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:330775.37890625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:45732.69140625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:12.7|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
... some time later when I call the API
2023-12-05T00:00:48,437 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734448
2023-12-05T00:00:48,458 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT to backend at: 1701734448458
2023-12-05T00:00:48,461 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Backend received inference at: 1701734448
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Preprocessing
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - received req=At the far end of town where the Gricklegrass grows and the wind smells slowandsour when it blows and no
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - birds ever sing excepting old crows is the Street of the Lifted Lorax
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And deep in the Gricklegrass some people say if you look deep enough you can still see today where the
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lorax once stood just as long as it could before somebody lifted the Lorax away
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - What was the Lorax Any why was it there And why was it lifted and taken somewhere from the far end of
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - town where the Gricklegrass grows The old Onceler still lives here
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Ask him he knows
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - You wont see the Onceler Dont knock at his door He stays in his Lerkim on top of his store He stays in his
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lerkim cold under the floor where he makes his own clothes out of miffmuffered moof And on special dank
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - midnights in August he peeks out of the shutters and sometimes he speaks and tells how the Lorax was lifted
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - away Hell tell you perhaps if youre willing to pay
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - On the end of a rope he lets down a tin pail and you have to toss in fifteen cents and a nail and the shell of a
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - greatgreatgreat grandfather snail
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then he pulls up the pail makes a most careful count to see if youve paid him the proper amount Then he
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hides what you paid him away in his Snuvv his secret strange hole in his gruvvulous glove Then he grunts I
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - will call you by WhispermaPhone for the secrets I tell you are for your ears alone
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - SLUPP Down slupps the WhispermaPhone to your ear and the old Oncelers whispers are not very clear
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - since they have to come down through a snergelly hose and he sounds as if he had smallish bees up his nose
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Now Ill tell you he says with his teeth sounding gray how the Lorax got lifted and taken away It all started
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - way back such a long long time back
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Way back in the days when the grass was still green and the pond was still wet and the clouds were still clean
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - and the song of the SwomeeSwans rang out in space one morning I came to this glorious place And I first
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - saw the trees The Truffula Trees The brightcolored tufts of the Truffula Trees Mile after mile in the fresh
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - morning breeze
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And under the trees I saw Brown Barbaloots frisking about in their Barbaloot suits as the played in the
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - shade and ate Truffula Fruits From the rippulous pond came the comfortable sound of the HummingFish
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - humming while splashing around
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But those trees Those trees Those Truffula Trees All my life Id been searching for trees such as these The
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - touch of their tufts was much softer than silk And they had the sweet smell of fresh butterfly milk
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG -
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I felt a great leaping of joy in my heart I knew just what Id do I unloaded my cart In no time at all I had built
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - a small shop Then I chopped down a Truffula Tree with one chop And with great skillful skill and with great
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - speedy speed I took the soft tuft And I knitted a Thneed
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - The instant Id finished I heard a gaZump I looked I saw something pop out of the stump of the tree Id
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chopped down It was sort of a man Describe himThats hard I dont know if I can He was shortish and
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - oldish and brownish and mossy And he spoke with a voice that was sharpish and bossy
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Mister He said with a sawdusty sneeze I am the Lorax I speak for the trees I speak for the trees for the trees
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - have no tongues And Im asking you sir at the top of my lungs he was very upset as he shouted and puffed
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Whats that THING youve made out of my Truffula tuft
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Look Lorax I said Theres no cause for alarm I chopped just one tree I am doing no harm Im being quite
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - useful This thing is a Thneed A Thneeds a FineSomethingThatAllPeopleNeed Its a shirt Its a sock Its a
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - glove Its a hat But it has other uses Yes far beyond that You can use it for carpets For pillows For sheets
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Or curtains Or covers for bicycle seats The Lorax said Sir You are crazy with greed There is no one on earth
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - who would buy that fool Thneed
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the very next minute I proved he was wrong For just at that minute a chap came along and he thought
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - that the Thneed I had knitted was great He happily bought it for three ninetyeight I laughed at the Lorax You
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - poor stupid guy You never can tell what some people will buy
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I repeat cried the Lorax I speak for the trees
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Im busy I told him Shut up if you please I rushed cross the room and in no time at all built a radiophone I
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - put in a quick call I called all my brothers and uncles and aunts and I said listen here Heres a wonderful
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chance for the whole Onceler Family to get mighty rich Get over here fast Take the road to North Nitch Turn
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - left at Weehawken Sharp right at South Stitch
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And in no time at all in the factory I built the whole Onceler Family was working full tilt We were all knitting
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds just as busy as bees to the sound of the chopping of Truffula Trees
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then Oh Baby Oh How my business did grow Now chopping one tree at a time was too slow So I quickly
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - invented my SuperAxeHacker which whacked off four Truffula Trees at one smacker We were making
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds four times as fast as before And that Lorax He didnt show up any more
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the next week he knocked on my new office door He snapped Im the Lorax who speaks for the trees
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - which you seem to be chopping as fast as you please But Im also in charge of the Brown Barbaloots who
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - played in the shade in their Barbaloot suits and happily lived eating Truffula Fruits NOWthanks to your
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hacking my trees to the ground theres not enough Truffula Fruit to go round
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And my poor Barbaloots are all getting the crummies because they have gas and no food in their tummies
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - They loved living here But I cant let them stay Theyll have to find food And I hope that they may Good luck
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - boys he cried And he sent them away
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I the Onceler felt sad as I watched them all go BUT business is business And business must grow
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - regardless of crummies in tummies you know
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I meant no harm I most truly did not But I had to grow bigger So bigger I got I biggered my factory I
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - biggered my roads I biggered my wagons I biggered the loads of the Thneeds I shipped out I was shipping
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - them forth to the South To the East To the West To the North I went right on biggeringselling more
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds And I biggered my money which everyone needs
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 3
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG -
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - This story is about
2023-12-05T00:00:48,508 [INFO ] W-9000-llama-2-13b_1.0 ACCESS_LOG - /127.0.0.1:50848 "POST /predictions/llama-2-13b HTTP/1.1" 200 73
2023-12-05T00:00:48,510 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734448
2023-12-05T00:00:48,511 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,523 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,590 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,658 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,725 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,793 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,860 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,928 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,995 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:09,063 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:09,130 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
.....
2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:12,609 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Inferance
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,1701734472,beab1a87-913c-4302-9548-c25943c30243, pattern=[METRICS]
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:2.4171749336E7|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:20370.777|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.job.Job - Waiting time ns: 20370777, Backend time ns: 24152110030
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - QueueTime.Milliseconds:20.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 24125
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:27.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_METRICS - HandlerTime.ms:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,requestID:beab1a87-913c-4302-9548-c25943c30243,timestamp:1701734472
Ask
Is there something I'm missing in the config or use of Llama NeuronX to remove the long context overhead? I would like sub-second initial token latency for 500-3000 token contexts.
The alternative is to deploy with SageMaker, but I don't have that setup because we want to rewrite infrence.py to extract logits and limit Lllama to constrained generation
Let me know if any other details would be helpful in troubleshooting this. Thanks!
SDXL-base works perfectly on Inf2 chips. Different SDXL pipelines (inpaint, img2img ) are also working perfectly. But, as far as I read/try, there is no support for ControlNet and IPAdapter. Are these features on development roadmaps in future Neuron releases.
I tried to run the hf_pretrained_sd2_512_inference.ipynb on inf2.8xlarge with compiler version NeuronX Compiler version 2.10.0.34+6c8792c6f and got the RuntimeError when loading the model even the compile finished successfully.
The message shows "RuntimeError: Neuron runtime cannot be initialized; cannot determine the number of available NeuronCores"
when I tried to load the unet onto neuron cores by the following script.
pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)
Any idea?
thanks
Running the notebook https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb on inf2.48x
Getting this error when executing the last cell of the notebook
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[3], line 1
----> 1 neuron_model.to_neuron()
File ~/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:117, in LlamaForSampling.to_neuron(self)
115 self.decoder_lm_head_for_context = {}
116 for context_length_estimate in self.context_buckets:
--> 117 model = self.decoder_lm_head.build_weight_shared(
118 n_positions_list=[context_length_estimate],
119 n_active_tokens=context_length_estimate,
120 unroll=self.context_unroll,
121 share_caches=True,
122 )
123 # PERF: No latency improvement seen in multi-layer models from executor
124 if self.context_unroll == self.config.num_hidden_layers:
File ~/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/decoder.py:157, in DecoderLmHeadForSamplingNoEmbedding.build_weight_shared(self, n_positions_list, n_active_tokens, batch_size, unroll, share_caches)
155 ln_lm_head_params.append(new.lm_head_bias)
156 new.program = new._build_program()
--> 157 new.program.setup(new.layers, ln_lm_head_params)
158 return new
File ~/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/decoder.py:983, in DecoderProgramFullyUnrolled.setup(self, layers, ln_lm_head_params)
982 def setup(self, layers, ln_lm_head_params):
--> 983 super().setup(layers, ln_lm_head_params)
984 for npos, memory in zip(self.n_positions_list, self.memories):
985 input_tensors = [*self.input_buffers]
File ~/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/decoder.py:879, in DecoderProgram.setup(self, layers, ln_lm_head_params)
876 kernel.neff_bytes = future.result()
878 for kernel in self.kernels:
--> 879 kernel.load()
File ~/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/compiler.py:375, in ParallelKernel.load(self)
374 def load(self):
--> 375 assert self.neff_bytes is not None, f"Try to load with neff bytes as None, might due to compilation failure"
376 self.model = torch.classes.neuron.ParallelModel(self.neff_bytes, self.tp_degree, self.g_start_device_id, self.g_device_count)
377 self.model.load()
AssertionError: Try to load with neff bytes as None, might due to compilation failure
After running the code until the compilation part, the models do not exist. The compilation logs indicate that the process completes without errors, but the expected model file model.pt
is missing from the directory sd2_compile_dir_768/unet/
.
source /opt/aws_neuronx_venv_pytorch_2_1/bin/activate
python3 test3.py
The model file model.pt
should be present in the directory sd2_compile_dir_768/unet/
after the compilation process completes.
The model file model.pt
is missing from the directory sd2_compile_dir_768/unet/
.
2024-05-30T09:32:51Z Running birverifier
2024-05-30T09:32:52Z birverifier finished after 1.166 seconds
2024-05-30T09:32:52Z Running codegen
2024-05-30T09:32:57Z isa_gen finished after 4.293 seconds
2024-05-30T09:32:58Z dma_desc_gen finished after 1.495 seconds
2024-05-30T09:33:01Z debug_info_gen finished after 2.790 seconds
2024-05-30T09:33:02Z codegen finished after 9.213 seconds
2024-05-30T09:33:02Z Running neff_packager
2024-05-30T09:33:29Z neff_packager finished after 27.627 seconds
Traceback (most recent call last):
File "/home/ubuntu/test3.py", line 124, in <module>
pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/jit/_serialization.py", line 152, in load
raise ValueError(f"The provided filename {f} does not exist") # type: ignore[str-bytes-safe]
ValueError: The provided filename sd2_compile_dir_768/unet/model.pt does not exist
Key | Value |
---|---|
Repository | aws-neuron-samples |
Template Used | hf_pretrained_sd2_768_inference.ipynb |
Script | test.py (for compilation) |
Where is the list of parameters available for model.generate (huggingface generate support), the last step? I want the output devoid of any text from the prompt.
model_cpu = LlamaForCausalLM.from_pretrained('models--meta-llama--Llama-2-13b-hf/')
model_neuron = neuron_model
HuggingFaceGenerationModelAdapter
to access the generate APImodel = HuggingFaceGenerationModelAdapter(model_cpu.config, model_neuron)
tokenizer = AutoTokenizer.from_pretrained('models--meta-llama--Llama-2-13b-hf/')
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors='pt', padding=True)
model.reset_generation()
sample_output = model.generate(
input_ids=encoded_input.input_ids,
attention_mask=encoded_input.attention_mask,
do_sample=True,
max_length=256,
temperature=0.7,
)
With latest 0.5 version and Neuron SDK 2.12, some tutorials like this https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/gpt-j-6b-sampling.ipynb is hitting error "RuntimeError: init() expected at most 3 argument(s) but received 5 argument(s)":
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 930/930 [00:00<00:00, 264kB/s]Downloading pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24.2G/24.2G [04:59<00:00, 80.9MB/s]
.....
Compiler status PASS
....
Compiler status PASS
....
Compiler status PASS
....
Compiler status PASS
Traceback (most recent call last):
File "gptj.py", line 28, in <module>
neuron_model.to_neuron()
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/gptj/model.py", line 72, in to_neuron
self.program.setup(self.transformer.h, self.ln_lm_head)
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/program.py", line 102, in setup
kernel.load()
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 376, in load
self.model = torch.classes.neuron.ParallelModel(self.neff_bytes, self.tp_degree, self.g_start_device_id, self.g_device_count)
RuntimeError: __init__() expected at most 3 argument(s) but received 5 argument(s). Declaration: __init__(__torch__.torch.classes.neuron.ParallelModel _0, str _1, int _2) -> NoneType _0
My environment is aws server inf2.8xlarge
python : 3.8.10
torch-neuronx : 2.1.1.2.0.1b0
neuronx-cc : 2.12.68.0+4480452af
I'm trying to compile esrgan torch model to neuron but I have an issue.
from PIL import Image
import requests
import torch
import torch_neuronx
from torchvision import models
from torchvision.transforms import functional
from modules.esrgan_upscale import upscale_model_loader
import os
os.environ["NEURON_CC_FLAGS"] = "-O1"
# load the model
model = upscale_model_loader('modules/weight/4x-Ultrasharp.pth')
model.eval()
# Get an example input
image = Image.open('/home/ubuntu/diffusers-ultimate-upscale/testIm.png')
image = image.convert('RGB')
image = functional.to_tensor(image)
image = torch.unsqueeze(image, 0)
# Run inference on CPU
output_cpu = model(image)
# Compile the model
model_neuron = torch_neuronx.trace(model, image,compiler_args=['--optlevel','1'])
# Save the TorchScript for inference deployment
filename = 'model.pt'
torch.jit.save(model_neuron, filename)
when I run this code
first it gives me this log
2024-02-20T13:36:54Z Compilation is optimized for best performance and compilation time. For faster compilation time please use -O1
I want to compile with -O1
because of this error log (yes, i failed compile)
[XTP002] Too many instructions after unroll for function sg0000! - Compiling under --optlevel=1 may result in smaller graphs. If you are using a transformer model, try using a smaller context_length_estimate value.
I can't set the optlevel flag to 1 ... even I changed inside the module code like this
command = [
neuron_cc,
"compile",
filename,
"--framework",
"XLA",
"--target",
"trn1",
"--output",
neff_filename,
"--optlevel",
"1"
]
command.extend(compiler_args)
what should I do if I want to compile with --optlevel=1 with torch_neuronx.trace
?
The llama13b notebook runs fine on inf2.48x instance. While running it on inf2.24x, I reduced the tp_degree from 24 to 12 but the code throws an error in the following step-
neuron_model = LlamaForSampling.from_pretrained('./Llama-2-13b-split', batch_size=1, tp_degree=12, amp='f16')
neuron_model.to_neuron()
Error
FileNotFoundError: [Errno 2] No such file or directory: 'neuronx-cc'
Is this notebook supported on a 24x instance? Or what else might be missing? The environment setup is the same in both cases.
When I want to load the model for inference following the steps give on reference file: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb
import os
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling
os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference"
# load meta-llama/Llama-2-7b-chat to the NeuronCores with 2-way tensor parallelism and run compilation
neuron_model = LlamaForSampling.from_pretrained('llama-2-7b-chat-hf-chunked', batch_size=1, tp_degree=2, amp='f16')
neuron_model.to_neuron()
Following is the error I receive,
{
"name": "RuntimeError",
"message": "Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-09-19T09:36:54Z Too many instructions after unroll for function sg0000 !
",
"stack": "---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
\"\"\"
Traceback (most recent call last):
File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/compiler.py\", line 411, in compile
self.build(tag=tag)
File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/compiler.py\", line 418, in build
self.neff_bytes = compile_hlo_module(self.hlo_module, tag)
File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/compiler.py\", line 95, in compile_hlo_module
neff_bytes = neuron_xla_compile(module_bytes, flags, input_format=\"hlo\", platform_target=\"trn1\",
File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/libneuronxla/__init__.py\", line 38, in neuron_xla_compile
_neuron_cc_wrapper.neuron_xla_compile(
File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 234, in neuron_xla_compile
done = check_neff(compile_cache, neff_path,
File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 77, in check_neff
raise(RuntimeError(error_log))
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-09-19T09:36:54Z Too many instructions after unroll for function sg0000 !
\"\"\"
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb Cell 7 line 1
<a href='vscode-notebook-cell://ssh-remote%2Binf2-instance-deployment-testing/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb#X13sdnNjb2RlLXJlbW90ZQ%3D%3D?line=8'>9</a> # load meta-llama/Llama-2-7b-chat to the NeuronCores with 2-way tensor parallelism and run compilation
<a href='vscode-notebook-cell://ssh-remote%2Binf2-instance-deployment-testing/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb#X13sdnNjb2RlLXJlbW90ZQ%3D%3D?line=9'>10</a> neuron_model = LlamaForSampling.from_pretrained('llama-2-7b-chat-hf-chunked', batch_size=1, tp_degree=2, amp='f16')
---> <a href='vscode-notebook-cell://ssh-remote%2Binf2-instance-deployment-testing/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb#X13sdnNjb2RlLXJlbW90ZQ%3D%3D?line=10'>11</a> neuron_model.to_neuron()
File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:122, in LlamaForSampling.to_neuron(self)
120 self.decoder_lm_head_for_context = {}
121 for context_length_estimate in self.context_buckets:
--> 122 model = self.decoder_lm_head.build_weight_shared(
123 n_positions_list=[context_length_estimate],
124 n_active_tokens=context_length_estimate,
125 unroll=self.context_unroll,
126 share_caches=True,
127 )
128 # PERF: No latency improvement seen in multi-layer models from executor
129 if self.context_unroll == self.config.num_hidden_layers:
File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/decoder.py:163, in DecoderLmHeadForSamplingNoEmbedding.build_weight_shared(self, n_positions_list, n_active_tokens, batch_size, unroll, share_caches)
161 ln_lm_head_params.append(new.lm_head_bias)
162 new.program = new._build_program()
--> 163 new.program.setup(new.layers, ln_lm_head_params)
164 return new
File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/decoder.py:1029, in DecoderProgramFullyUnrolled.setup(self, layers, ln_lm_head_params)
1028 def setup(self, layers, ln_lm_head_params):
-> 1029 super().setup(layers, ln_lm_head_params)
1030 for npos, memory in zip(self.n_positions_list, self.memories):
1031 input_tensors = [*self.input_buffers]
File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/decoder.py:919, in DecoderProgram.setup(self, layers, ln_lm_head_params, io_ring_cache_size)
917 neff_bytes_futures.append(future)
918 for kernel, future in zip(self.kernels, neff_bytes_futures):
--> 919 kernel.neff_bytes = future.result()
921 for kernel in self.kernels:
922 kernel.load(io_ring_cache_size)
File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/concurrent/futures/_base.py:458, in Future.result(self, timeout)
456 raise CancelledError()
457 elif self._state == FINISHED:
--> 458 return self.__get_result()
459 else:
460 raise TimeoutError()
File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)
401 if self._exception:
402 try:
--> 403 raise self._exception
404 finally:
405 # Break a reference cycle with the exception in self._exception
406 self = None
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-09-19T09:36:54Z Too many instructions after unroll for function sg0000 !
"
}
Also, in the reference file, they use tp_degree=24
when working with inf2.48xlarge which has 384 GB of Accelerator Memory, since I am working with inf2.8xlarge with 32 GB of Accelerator memory, I used tp_degree=2
I have the following versions of the dependencies installed,
Requirement already satisfied: neuronx-cc==2.* in /home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages (2.10.0.34+6c8792c6f)
Requirement already satisfied: transformers-neuronx in /home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages (0.7.84)
aws-neuronx-dkms is already the newest version (2.13.4.0).
aws-neuronx-collectives is already the newest version (2.17.9.0-fb6d14044).
aws-neuronx-runtime-lib is already the newest version (2.17.7.0-df62e3f70).
aws-neuronx-tools is already the newest version (2.14.6.0).
There are examples how to use yolov5-7. Does Neuron support yolov8?
After installing the missing tranformers
library, I am getting the error:
RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-6501c5f9-3743f5630dd72644195e9e21;e4840366-a585-4130-9e9a-1da114b8ec72)
Repository Not Found for url: https://huggingface.co/Llama-2-13b/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
Model used:
Llama-2-13b-chat-hf
Successfully ran the prompt in notebook example:
prompt = "Hello, I'm a language model,"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# run inference with top-k sampling
with torch.inference_mode():
start = time.time()
generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
elapsed = time.time() - start
generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')
By it failed by just replacing prompt with a longer text (updated seqence_length to 4096 also gave the same result):
LONG = """Summarize the text below:
---
EXTENDING CONTEXT WINDOW OF LARGE LAN-
GUAGE MODELS VIA POSITION INTERPOLATION
Shouyuan Chen Sherman Wong Liangjian Chen Yuandong Tian
Meta Platforms Inc.
{chenshouyuan, shermanwong, cli, yuandong}@meta . com
1 INTRODUCTION
Large language models (LLMs) typically come with a pre-defined context window size. For exam-
ple, inputs to LLaMA models (Touvron et al., 2023) must be fewer than 2048 tokens. This pre-set
context window limit is frequently exceeded in applications such as conducting long conversations,
summarizing long documents, or executing long-term planning. For these applications, LLMs with
longer context windows are preferred. However, training an LLM from scratch with long context
windows requires significant investments. This naturally leads to a question: Can we extend the
context window of an existing pre-trained LLM?
One straightforward approach is to fine-tune an existing pre-trained Transformer with a longer con-
text window. However, empirically, we found that models trained this way adapt to long context
windows very slowly. After training for more than 10000 batches, the effective context window
saw a minimal increase, moving from 2048 to 2560 (Table 4). This suggests that such method is
inefficient for extending to substantially longer context windows.
While certain techniques such as ALiBi (Press et al., 2022) and LeX (Sun et al., 2022) enable length
extrapolation of Transformers, i.e. train on short context windows and inference on longer ones,
many existing pre-trained LLMs, including LLaMA (Touvron et al., 2023), use positional encodings
that have weak extrapolation properties (e.g., RoPE (Su et al., 2021)). Therefore, the applicability
of these techniques for extending the context window sizes of such LLMs remains limited.
In this work, we introduce Position Interpolation to enable context window extensions for certain
existing pre-trained LLMs, including LLaMA. The key idea is, instead of extrapolation, we directly
down-scale the position indices so that the maximum position index matches the previous context
window limit in the pre-training stage. See Figure 1 for an illustration. In other words, to accom-
modate more input tokens, we interpolate the position encodings at neighboring integer positions,
utilizing the fact that position encodings can be applied on non-integer positions, as opposed to
extrapolating outside the trained positions, which may lead to catastrophic values. We verify our
approach theoretically, by showing that the interpolated attention score has a much smaller upper
bound (~ 600x smaller in LLaMA 7B setting) than the extrapolated one, and is thus much more
stable. Therefore, interpolated position encodings are easier for the model to adapt.
Empirically, we found that Position Interpolation is highly effective and efficient, requiring only a
very short period of fine-tuning for the model to fully adapt to greatly extended context windows.
We present experimental results for extending the context window to up to 32768 from the initial
2048 across 7B to 65B LLaMA models using Position Interpolation. Our results show that
1. Position Interpolation can easily enable very long context windows (e.g. 32768), requiring
only fine-tuning for 1000 steps on the Pile (Gao et al., 2020) to achieve a good quality.
The cost of fine-tuning is negligible compared to the pre-training costs. This confirms
our hypothesis that it is relatively easy for the models to adapt to interpolated position
encodings.
2. Position Interpolation generates strong models that can effectively make use of much ex-
tended context window. We show that models extended by Position Interpolation enjoy
significant perplexity gains from greatly extended context windows for text modeling, and
we show that the perplexity reduces graceful with the enlargement of context windows.
We also applied Position Interpolation in a long text summarization task, and demonstrate
competitive performances.
3. Position Interpolation preserves model quality relatively well for tasks within its original
context window sizes. We present a variety of evaluation results for the extended LLaMA
models on the original LLaMA benchmark. Compared with original LLaMA models, the
extended LLLaM A models saw a minor degradation on several standard benchmarks within
a 2048 token limit.
Our results highlight the innate ability of Transformer models to “extrapolate to sequence lengths
longer than the ones encountered during training” as hypothesized in the seminal work of Vaswani
et al. (2017). We reaffirm this hypothesis and suggest that the previously known weakness of ex-
trapolating to longer sequences for language modeling (Press et al., 2022) may be due to direct
extrapolation of positional encodings and it can be largely mitigated by interpolating position en-
codings instead.
Concurrent work. Right before our release, we are informed with a concurrent blogpost (Super-
HOT kaiokendev (2023)) that also interpolates positional encoding in RoPE to extend the context
window from 2K to 8K. Recently, open source community picks it up in Reddit post ! and Github
Issues 2, which shows that fine-tuning with LoRA (Hu et al., 2021) also seems to work well. Our
paper shows a full fine-tuning with up to 65B model work well with Position Interpolation, and we
also give theoretical explanations why interpolation achieves much more stable results than extrap-
olation, by showing that the upper bound of interplated attention score is much lower than that of
extrapolated ones.
2 METHOD
2.1 BACKGROUND: ROTARY POSITION EMBEDDING (ROPE)
Transformer models require explicit positional information to be injected, typically in the form of
positional encodings, to represent the order of inputs. We consider Rotary Position Embedding
(ROPE) (Su et al., 2021), which is the position encoding used in the LLLaMA model (Touvron et al.,
2023). Given a position index m € [0, ¢) and an embedding vector x := [zg, 71,..., 241], Where
d is the dimension of the attention head, RoPE defines a vector-valued complex function f{x, m) as
follows
Using RoPE, the self-attention score
is only dependent on relative position m — 7 through trigonometric functions. Here q and k are the
query and key vector for a specific attention head. At each layer, RoPE is applied on both query and
key embeddings for computing attention scores.
2.2 DIRECT EXTRAPOLATION
While the attention score in RoPE only depends on the relative positions, which is what we want,
its extrapolation performance is not great . In particular, when directly extending to larger context
windows unseen in the training, the perplexity may shoot up to very high numbers (i.e., > 10%),
comparable to untrained models.
Ideally, we want to see the model trained on a context window of size L = 2048 to still work
reasonably well on longer context window, but may not have the capability to leverage information
that appears beyond L. For example, to answer a question located at 3000, the model trained on
maximal window size of I = 2048 cannot leverage evidences provided at location 0, but still
can leverage the evidences provided at location 2900. In contrast, in reality we see catastrophic
behaviors, i.e., question at location 3000 cannot be answered correctly, even if the evidences are
located at location 2900.
What is the reason behind? How could this happen if the attention score a,,,—,, decays as the relative
distance |m — n/| increases, according to Section 3.4.3 of (Su et al., 2021), and content from very
far distances should not matter that much? It turns out that the upper bound derived in Section 3.4.3
of (Su et al., 2021) may be too loose: while it indeed decays with respect to |m — nl, the bound
can still be quite large (i.e., the bound can be critically depends on the magnitude of v;) and thus
vacuous. In fact, if we treat all trigonometric functions as basis functions (i.e, ¢;(s) := #93), and
think about Eqn. 2 as basis expansion as the following:
where s is the positional span between a query and a key and h; := (ga; + igaj+1){k2j — tk2j+1)
are complex coefficients depending on q and k (here the definition of h; is exactly the same as the
definition of k; in Sec 3.4.3 in RoPE (Su et al., 2021)). Now the the issue becomes clear: as shown
in Fig. 2, a, can be small in magnitude in the range of [0, 2048], but gives huge values out of the
region. The underlying reason is that the trigonometric family {¢;} (with sufficiently large d) is
a universal approximator and can fit any arbitrary functions. Therefore, for a, there always exist
coefficients {h;} (i.e. key and query) that corresponds to small function values in [0, 2048] but
much larger in regions beyond.
---
"""
prompt = LONG
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# run inference with top-k sampling
with torch.inference_mode():
start = time.time()
generated_sequences = neuron_model.sample(input_ids, sequence_length=4096, top_k=50)
elapsed = time.time() - start
generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')
Error log:
{
"name": "StopIteration",
"message": "",
"stack": "---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb Cell 18 line 8
<a href='vscode-notebook-cell://ssh-remote%2Binf2-nv/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb#X33sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5'>6</a> with torch.inference_mode():
<a href='vscode-notebook-cell://ssh-remote%2Binf2-nv/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb#X33sdnNjb2RlLXJlbW90ZQ%3D%3D?line=6'>7</a> start = time.time()
----> <a href='vscode-notebook-cell://ssh-remote%2Binf2-nv/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb#X33sdnNjb2RlLXJlbW90ZQ%3D%3D?line=7'>8</a> generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
<a href='vscode-notebook-cell://ssh-remote%2Binf2-nv/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb#X33sdnNjb2RlLXJlbW90ZQ%3D%3D?line=8'>9</a> elapsed = time.time() - start
<a href='vscode-notebook-cell://ssh-remote%2Binf2-nv/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb#X33sdnNjb2RlLXJlbW90ZQ%3D%3D?line=10'>11</a> generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:210, in LlamaForSampling.sample(self, input_ids, sequence_length, start_ids, top_k, top_p, eos_token_override, temperature, streamer)
207 # Sequence length cannot be greater than n_positions
208 sequence_length = min(sequence_length, self.max_positions)
--> 210 result = sampling.sample_llama(
211 self, input_ids, start_ids, sequence_length,
212 eos_token_id=self.config.eos_token_id if eos_token_override is None else eos_token_override,
213 top_k=top_k, top_p=top_p, temperature=temperature, streamer=streamer
214 )
216 if offset != 0:
217 result = result[:, offset:]
File /opt/conda/envs/inf2/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/sampling.py:243, in sample_llama(model, input_ids, start_ids, sequence_length, eos_token_id, top_k, top_p, temperature, streamer)
241 _, start = input_ids.shape
242 cache_ids = torch.arange(start, dtype=torch.int32)
--> 243 next_token_scores = model(input_ids, cache_ids, start_ids)
244 return sample_loop_llama(
245 model, input_ids, start_ids, next_token_scores, sequence_length, eos_token_id, top_k, top_p, temperature, streamer
246 )
File /opt/conda/envs/inf2/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:179, in LlamaForSampling.forward(self, input_ids, cache_ids, start_ids)
176 hidden = hidden.transpose(0, -1).contiguous()
178 if context_length > 1:
--> 179 logits = self.context(hidden, cache_ids, start_ids)
180 else:
181 logits = self.decoder_lm_head(hidden, cache_ids, start_ids)
File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:163, in LlamaForSampling.context(self, hidden, cache_ids, start_ids)
161 cache_ids = torch.as_tensor([i], dtype=torch.int32)
162 hidden_slice = hidden[:, i:i+1].contiguous()
--> 163 logits = self.decoder_lm_head(hidden_slice, cache_ids, start_ids)
165 return logits
File /opt/conda/envs/inf2/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/decoder.py:186, in DecoderLmHeadForSamplingNoEmbedding.forward(self, *inputs)
184 sequence_length = hidden.shape[sequence_dim]
185 if sequence_length == 1:
--> 186 return self.forward_single(*inputs)
187 if sequence_length % self.n_active_tokens:
188 raise ValueError(f'sequence_length={sequence_length} cannot be divided by '
189 f'n_active_tokens={self.n_active_tokens}')
File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/decoder.py:173, in DecoderLmHeadForSamplingNoEmbedding.forward_single(self, *inputs)
165 \"\"\"
166 Fast-path forward function which avoids as much overhead as possible.
167
(...)
170 etc.
171 \"\"\"
172 _, cache_ids, *_ = inputs
--> 173 bucket_id = self.program.find_bucket_id(cache_ids.item())
174 if self.use_executor:
175 return self.program.execute(bucket_id, *inputs, return_ranks=self.return_ranks)
File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/decoder.py:903, in DecoderProgram.find_bucket_id(self, length)
902 def find_bucket_id(self, length):
--> 903 return next(idx for idx, npos in enumerate(self.n_positions_list) if npos >= length)
StopIteration: "
}
I managed to compile the notebook in the samples to load an OPT model in inf2
chips. :slight_smile:
However, at one point I load the network and put it to neuron.
neuron_model = OPTForSampling.from_pretrained('./opt-13b-split', batch_size=2, tp_degree=2, amp='f16')
neuron_model.to_neuron()
and if I take a smaller model and increase the batch size, it can take ages (20 minutes or so).
Since I try to dockerize my network, can I somehow speed that up, such that my containers start up fast on Kubernetes?
but getting error about transformers library missing
Fix: added transformers
to the pip install...
Hi,
Is it possible to deploy meta-llama-2-13b-sampling.ipynb on inf2.24xlarge machine?.
The sample code for GPT2 at https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_gpt2_feature_extraction_on_trn1.ipynb recommends that we pad the input before passing to the forward.
torch_neuronx.trace() expects a tensor or tuple of tensor inputs to use for tracing, so we unpack the tokenzier output. Additionally, the input shape that's used duing compilation must match the input shape that's used during inference. To handle this, we pad the inputs to the maximum size that we will see during inference.
But it has been observed that padding to the right for Causal Models leads to inaccurate results as can be seen here: huggingface/transformers#14521 (comment)
Additionally, torch_neuronx supports dynamic input only along its first (batch dimension). Whereas for any Causal LM, the length of the input rises along the sequence dimension after sampling in each subsequent forward pass.
Is there any recommended way/suggestions on how torch_neuronx can be used for Causal Language Models?
Hello!
I attempted to run the jupyter notebook on an inf2.48xlarge instance and, the following error occurred below:
I'm not sure what was the cause of such error, but this is what the neuron_artifacts generated:
absl-py==2.1.0
accelerate==0.23.0
aiofiles==23.2.1
aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.4.0
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work
async-timeout==4.0.3
attrs==23.2.0
aws-neuronx-runtime-discovery==2.9
beautifulsoup4==4.12.3
blinker==1.8.2
boto3==1.34.115
botocore==1.34.115
cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
cloud-tpu-client==0.10
coloredlogs==15.0.1
comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1710320294760/work
dataclasses-json==0.6.6
datasets==2.19.1
debugpy @ file:///home/conda/feedstock_root/build_artifacts/debugpy_1707444420542/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
Deprecated==1.2.14
dill==0.3.8
dirtyjson==1.0.8
distro==1.9.0
docutils==0.21.2
duckduckgo_search==6.1.2
ec2-metadata==2.10.0
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work
executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1698579936712/work
filelock==3.14.0
Flask==3.0.3
frozenlist==1.4.1
fsspec==2024.3.1
google-api-core==1.34.1
google-api-python-client==1.8.0
google-auth==2.29.0
google-auth-httplib2==0.2.0
googleapis-common-protos==1.63.0
greenlet==3.0.3
h11==0.14.0
h2==4.1.0
hpack==4.0.0
httpcore==1.0.5
httplib2==0.22.0
httpx==0.27.0
huggingface-hub==0.23.2
humanfriendly==10.0
Hypercorn==0.17.3
hyperframe==6.0.1
idna==3.7
importlib_metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1710971335535/work
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1708996548741/work
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1715263367085/work
islpy==2023.1
itsdangerous==2.2.0
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
jsonpatch==1.33
jsonpointer==2.4
jupyter_client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1716472197302/work
jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1710257277185/work
langchain==0.2.1
langchain-community==0.2.1
langchain-core==0.2.2
langchain-text-splitters==0.2.0
langsmith==0.1.63
libneuronxla==2.0.965
llama-index==0.10.40
llama-index-agent-openai==0.2.5
llama-index-cli==0.1.12
llama-index-core==0.10.40
llama-index-embeddings-huggingface==0.2.1
llama-index-embeddings-openai==0.1.10
llama-index-indices-managed-llama-cloud==0.1.6
llama-index-legacy==0.9.48
llama-index-llms-openai==0.1.21
llama-index-multi-modal-llms-openai==0.1.6
llama-index-program-openai==0.1.6
llama-index-question-gen-openai==0.1.3
llama-index-readers-file==0.1.23
llama-index-readers-llama-parse==0.1.4
llama-parse==0.4.4
llamaindex-py-client==0.1.19
lockfile==0.12.2
MarkupSafe==2.1.5
marshmallow==3.21.2
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1713250518406/work
minijinja==2.0.1
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
mypy-extensions==1.0.0
nest_asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1705850609492/work
networkx==2.6.3
neuronx-cc==2.13.66.0+6dfecc895
neuronx-distributed==0.7.0
nltk==3.8.1
numpy==1.25.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
oauth2client==4.1.3
openai==1.30.5
optimum==1.18.1
optimum-neuron==0.0.22
orjson==3.10.3
outcome==1.3.0.post0
packaging==23.2
pandas==2.2.2
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1712320355065/work
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1706113125309/work
pgzip==0.3.5
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
pillow==10.3.0
platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1715777629804/work
priority==2.0.0
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1702399386289/work
protobuf==3.19.6
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1705722392846/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work
pyarrow==16.1.0
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pydantic==2.7.2
pydantic_core==2.18.3
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1714846767233/work
pyparsing==3.1.2
pypdf==4.2.0
pyreqwest_impersonate==0.4.6
PySocks==1.7.1
python-daemon==3.0.1
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1709299778482/work
python-dotenv==1.0.1
pytz==2024.1
PyYAML==6.0.1
pyzmq @ file:///home/conda/feedstock_root/build_artifacts/pyzmq_1715024398995/work
Quart==0.19.6
regex==2024.5.15
requests==2.32.3
requests-unixsocket==0.3.0
rsa==4.9
s3transfer==0.10.1
safetensors==0.4.3
scikit-learn==1.5.0
scipy==1.11.2
selenium==4.21.0
sentence-transformers==2.7.0
sentencepiece==0.2.0
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
sniffio==1.3.1
sortedcontainers==2.4.0
soupsieve==2.5
SQLAlchemy==2.0.30
stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work
striprtf==0.0.26
sympy==1.12.1
taskgroup==0.0.0a4
tenacity==8.3.0
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.15.2
tomli==2.0.1
torch==2.1.2
torch-neuronx==2.1.2.2.1.0
torch-xla==2.1.2
torchvision==0.16.2
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1708363098266/work
tqdm==4.66.4
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1713535121073/work
transformers==4.36.2
transformers-neuronx==0.10.0.21
trio==0.25.1
trio-websocket==0.11.1
triton==2.1.0
typing-inspect==0.9.0
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1712329955671/work
tzdata==2024.1
uritemplate==3.0.1
urllib3==2.2.1
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1704731205417/work
Werkzeug==3.0.3
wrapt==1.16.0
wsproto==1.2.0
xxhash==3.4.1
yarl==1.9.4
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1695255097490/work
Running the code (exactly) on https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-3-70b-sampling.ipynb
Would love to get some support on this!
I am following the steps (https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb) to run a Llama2 quantized model (https://huggingface.co/TheBloke/Dolphin-Llama2-7B-AWQ) on an AWS inf2 instance (Inf2 8x large)
I can run the code however when I try to generate a sequence I get a nonsense output stream
>>> neuron_model.sample(tokenizer.encode("who is prime minister of uk", return_tensors="pt"), sequence_length=2048, top_k=50, streamer=TextStreamer(tokenizer))
isherтак discoveryLENGrektta mel damтакudoudoisherkl̂LENGifarola melLENG destouselikhudocherิuskkl Hauptumar discovery Malludoikh moduleджа moduleelifudouskswerLENGkl discovery discoveryaltraungsusrLENG КурcherLENGLENGToolsivelivelusrungs Haupt geldig modulesivel modulesусrola discoverydelegate Haupt discoveryugeniture moduleselif›ugen Кеede geldig discovery Schl Mallivel HöheLENG audelegatedelegateusr КеedeLENG› Кур КеdelegateLENGudo›usr Mallrellppen›delegateivel Schldelegate accessibleodgeugenumar destдоваусdelegateToolsklundesede Кур Кеkl Mallugenentityikzdelegate discoveryanzen destusrungsppenentitychioíkíkkldelegate КеLENGrellToolsommenсиingu destLENGaussedeugnougnoppenikzíkLENG Mall auLENGrellikzivelugenkldelegatedelegateftyungsichtsdelegate Кеajuси Höheewусundesaju Курусikzík ensuiteichtsewzna ensuiteAccess discoverydelegate Кеdelegateinguboldmath nucitenusr accessibleedeLENGppenikzdelegateichtsdelegateundiallotikz Ке bon Кур Ке Курrell Schldelegate Schlус Ке MallLENGodgeǧikzкурغ Кеanzenlotppenungsdelegateichtsivel moduledelegatedelegaterellundialLENGinguungsivelichtshtusrdelegatedelegatehirehtichtschiohtdelegateedeغajuingu КеungsenschaftLENGLENGajuкурdelegateсиichtsikzտ MallLENGLENGLENGLENG auichtsси КеaussغewкуркурivelLENG modulesichtsLENGungs主 Кеchkikz主ajuichtsewugenichts nucichtsкур Schldelegate bonкурlotajuusrundialdentкуркурrellغikzugenусrelllotugenLENGinguppenchiochkajuhireкурppenichtsдвиhtanzeníkGRichtsichts Schl bon Schlchkchk nucdelegateichts Schlitenitenдви moduleznaajudelegatelotchkanzenlotἱAccessdelegateLENG nucinguchkitenppenусусdelegateкурдвиусikzundialajuenschaftdelegateznaдвикурichtschio Кеewadalichtsreesichtsտchioкурichtsenschaftichtsrell bonikzlot desc Mallкуркурchioсиadalenschaftinguppenusrhireikzivel Кеikzinguppen descdelegateusrikzichtsznaichtsewchkewrellAccessewichtsichtsдвикурikzznaichtslot Schlew nucíkкур nucAccessкурichtschioдвиivel firing nuc ordchiochkhireус auskeichtsodgeadalкурungsichtsewedeikz bonусewadalchkichtsATA主enschaftewusr
jurкурусppenichtsundialajuichtsLENGenschaftedeewichtsдвиppenichts sl nucchkadalкуркурichtsdelegateikzinguLEFTLEFTдвиchkкурchk bonundialundialadalundial Schlodgechk firing bonedeichts Abbкур desc Ке Schl descundialкурznalot auichts Schlclean Кеclean Mallchkadal reciznaadalundialichts formulachio Mallchioкурclean nucусhireATAichtshire desc desc recidelegatechioichtsichtschklotichtsusrichtsungs主rell Кеchioclean sl nucкуркурichtsadalundiallotGRewсиznaewhire主курewichtsкурсиichtsristichtscleanristichts ordAccessichtschkichtsdelegateungshireundialGRristíkodgeGRungs nucкур descLEFTinguLEFTikz Schlhirerellikzungsundial nucichtsкур AbbусewchioAccessodgeATA ```
In torch-neuronx/inference/hf_pretrained_sdxl_1024_inference.ipynb
, I tried to change [1, 4, 128, 128]
to [1, 4, 104, 152]
and it didn't work; more specifically I was able to trace the unet
and post_quant_conv
with such shape but not with the decoder.
Here's the error I got:
2023-09-08T21:17:33Z Too many instructions after unroll for function sg0000 !
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
File <timed exec>:10
File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:323, in trace(func, example_inputs, states, input_output_aliases, compiler_workdir, compiler_args, options)
320 compiler_workdir = context.name
322 with context:
--> 323 neff_filename, metaneff, flattener, packer = _trace(
324 func,
325 example_inputs,
326 states,
327 input_output_aliases,
328 compiler_workdir,
329 compiler_args,
330 options,
331 )
332 return create_neuron_model(
333 neff_filename,
334 metaneff,
(...)
338 input_output_aliases,
339 )
File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:416, in _trace(func, example_inputs, states, input_output_aliases, compiler_workdir, compiler_args, options)
413 handle.write(hlo.SerializeToString())
415 # Compile HLO to NEFF
--> 416 neff_filename = hlo_compile(model_dir, compiler_workdir, compiler_args)
418 metaneff = hlo_metaneff(hlo, input_parameter_names, updated_input_output_aliases)
420 return neff_filename, metaneff.SerializeToString(), flattener, packer
File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:281, in hlo_compile(filename, compiler_workdir, compiler_args)
274 elif status == -11:
275 logger.warning(
276 "The neuronx-cc (neuron compiler) crashed (SEGFAULT). "
277 "This is likely due to a bug in the compiler. "
278 "Please lodge an issue at 'https://github.com/aws/aws-neuron-sdk/issues'"
279 )
--> 281 raise RuntimeError(f"neuronx-cc failed with {status}")
283 return neff_filename
RuntimeError: neuronx-cc failed with 70
2023-09-08T21:17:23Z Running DoNothing
2023-09-08T21:17:23Z DoNothing finished after 0.000 seconds
2023-09-08T21:17:23Z Running CanonicalizeIR
2023-09-08T21:17:23Z CanonicalizeIR finished after 0.018 seconds
2023-09-08T21:17:23Z Running ExpandBatchNorm
2023-09-08T21:17:23Z ExpandBatchNorm finished after 0.057 seconds
2023-09-08T21:17:23Z Running ResolveComplicatePredicates
2023-09-08T21:17:23Z ResolveComplicatePredicates finished after 0.017 seconds
2023-09-08T21:17:23Z Running AffinePredicateResolution
2023-09-08T21:17:23Z AffinePredicateResolution finished after 0.019 seconds
2023-09-08T21:17:23Z Running EliminateDivs
2023-09-08T21:17:23Z EliminateDivs finished after 0.018 seconds
2023-09-08T21:17:23Z Running PerfectLoopNest
2023-09-08T21:17:23Z PerfectLoopNest finished after 0.016 seconds
2023-09-08T21:17:23Z Running Simplifier
2023-09-08T21:17:24Z Simplifier finished after 0.223 seconds
2023-09-08T21:17:24Z Running GenericAccessSimplifier
2023-09-08T21:17:24Z GenericAccessSimplifier finished after 0.015 seconds
2023-09-08T21:17:24Z Running TCTransform
2023-09-08T21:17:24Z TCTransform finished after 0.027 seconds
2023-09-08T21:17:24Z Running CommuteConcat
2023-09-08T21:17:24Z CommuteConcat finished after 0.016 seconds
2023-09-08T21:17:24Z Running TensorOpFusion
2023-09-08T21:17:24Z TensorOpFusion finished after 0.018 seconds
2023-09-08T21:17:24Z Running TensorOpTransform
2023-09-08T21:17:24Z TensorOpTransform finished after 0.060 seconds
2023-09-08T21:17:24Z Running LowerTensorOp
2023-09-08T21:17:24Z LowerTensorOp finished after 0.017 seconds
2023-09-08T21:17:24Z Running MemcpyElimination
2023-09-08T21:17:25Z MemcpyElimination finished after 1.058 seconds
2023-09-08T21:17:25Z Running LoopFusion
2023-09-08T21:17:26Z LoopFusion finished after 1.182 seconds
2023-09-08T21:17:26Z Running Simplifier
2023-09-08T21:17:26Z Simplifier finished after 0.112 seconds
2023-09-08T21:17:26Z Running Delinearization
2023-09-08T21:17:26Z Delinearization finished after 0.052 seconds
2023-09-08T21:17:26Z Running DeadStoreElimination
2023-09-08T21:17:28Z DeadStoreElimination finished after 1.288 seconds
2023-09-08T21:17:28Z Running Simplifier
2023-09-08T21:17:28Z Simplifier finished after 0.116 seconds
2023-09-08T21:17:28Z Running LICM
2023-09-08T21:17:28Z LICM finished after 0.064 seconds
2023-09-08T21:17:28Z Running Delinearization
2023-09-08T21:17:28Z Delinearization finished after 0.019 seconds
2023-09-08T21:17:28Z Running LoopFusion
2023-09-08T21:17:28Z LoopFusion finished after 0.224 seconds
2023-09-08T21:17:28Z Running SimplifySlice
2023-09-08T21:17:28Z SimplifySlice finished after 0.007 seconds
2023-09-08T21:17:28Z Running LICM
2023-09-08T21:17:28Z LICM finished after 0.019 seconds
2023-09-08T21:17:28Z Running Simplifier
2023-09-08T21:17:28Z Simplifier finished after 0.114 seconds
2023-09-08T21:17:28Z Running ValueNumbering
2023-09-08T21:17:28Z ValueNumbering finished after 0.036 seconds
2023-09-08T21:17:28Z Running LICM
2023-09-08T21:17:28Z LICM finished after 0.018 seconds
2023-09-08T21:17:28Z Running PadElimination
2023-09-08T21:17:28Z PadElimination finished after 0.001 seconds
2023-09-08T21:17:28Z Running Delinearization
2023-09-08T21:17:28Z Delinearization finished after 0.058 seconds
2023-09-08T21:17:28Z Running LoopFusion
2023-09-08T21:17:29Z LoopFusion finished after 0.218 seconds
2023-09-08T21:17:29Z Running GenericAccessSimplifier
2023-09-08T21:17:29Z GenericAccessSimplifier finished after 0.007 seconds
2023-09-08T21:17:29Z Running Simplifier
2023-09-08T21:17:29Z Simplifier finished after 0.111 seconds
2023-09-08T21:17:29Z Running LICM
2023-09-08T21:17:29Z LICM finished after 0.018 seconds
2023-09-08T21:17:29Z Running ValueNumbering
2023-09-08T21:17:29Z ValueNumbering finished after 0.024 seconds
2023-09-08T21:17:29Z Running TCTransform
2023-09-08T21:17:29Z TCTransform finished after 0.010 seconds
2023-09-08T21:17:29Z Running CommuteConcat
2023-09-08T21:17:29Z CommuteConcat finished after 0.008 seconds
2023-09-08T21:17:29Z Running RecognizeOpIdiom
2023-09-08T21:17:29Z RecognizeOpIdiom finished after 0.047 seconds
2023-09-08T21:17:29Z Running MaskPropagation
2023-09-08T21:17:29Z MaskPropagation finished after 0.023 seconds
2023-09-08T21:17:29Z Running Recompute
2023-09-08T21:17:29Z Recompute finished after 0.001 seconds
2023-09-08T21:17:29Z Running DeadCodeElimination
2023-09-08T21:17:29Z DeadCodeElimination finished after 0.008 seconds
2023-09-08T21:17:29Z Running DoNothing
2023-09-08T21:17:29Z DoNothing finished after 0.000 seconds
2023-09-08T21:17:29Z Running MutateDataType
2023-09-08T21:17:29Z MutateDataType finished after 0.006 seconds
2023-09-08T21:17:29Z Running AutoCastTCInputs
2023-09-08T21:17:29Z AutoCastTCInputs finished after 0.015 seconds
2023-09-08T21:17:29Z Running GenericAccessSimplifier
2023-09-08T21:17:29Z GenericAccessSimplifier finished after 0.009 seconds
2023-09-08T21:17:29Z Running Simplifier
2023-09-08T21:17:29Z Simplifier finished after 0.114 seconds
2023-09-08T21:17:29Z Running LegalizeCCOpLayout
2023-09-08T21:17:29Z LegalizeCCOpLayout finished after 0.008 seconds
2023-09-08T21:17:29Z Running DelinearIndices
2023-09-08T21:17:29Z DelinearIndices finished after 0.018 seconds
2023-09-08T21:17:29Z Running Delinearization
2023-09-08T21:17:29Z Delinearization finished after 0.017 seconds
2023-09-08T21:17:29Z Running DelinearIndices
2023-09-08T21:17:29Z DelinearIndices finished after 0.018 seconds
2023-09-08T21:17:29Z Running DeadCodeElimination
2023-09-08T21:17:29Z DeadCodeElimination finished after 0.008 seconds
2023-09-08T21:17:29Z Running InferIntrinsicOnCC
2023-09-08T21:17:29Z InferIntrinsicOnCC finished after 0.099 seconds
2023-09-08T21:17:29Z Running ResolveAccessConflict
2023-09-08T21:17:29Z ResolveAccessConflict finished after 0.065 seconds
2023-09-08T21:17:29Z Running LICM
2023-09-08T21:17:29Z LICM finished after 0.056 seconds
2023-09-08T21:17:29Z Running LocalLayoutOpt
2023-09-08T21:17:29Z LocalLayoutOpt finished after 0.053 seconds
2023-09-08T21:17:29Z Running DelinearIndices
2023-09-08T21:17:29Z DelinearIndices finished after 0.019 seconds
2023-09-08T21:17:29Z Running OrigLayoutTilingPipeline
2023-09-08T21:17:29Z Running GlobalLayoutOpt
2023-09-08T21:17:31Z GlobalLayoutOpt finished after 1.704 seconds
2023-09-08T21:17:31Z Running CanonicalizeDAG
2023-09-08T21:17:31Z CanonicalizeDAG finished after 0.082 seconds
2023-09-08T21:17:31Z Running FlattenAxesForTiling
2023-09-08T21:17:31Z FlattenAxesForTiling finished after 0.075 seconds
2023-09-08T21:17:31Z Running SundaSizeTiling
2023-09-08T21:17:33Z SundaSizeTiling finished after 1.930 seconds
2023-09-08T21:17:33Z OrigLayoutTilingPipeline finished after 3.809 seconds
2023-09-08T21:17:33Z Running TilingProfiler
2023-09-08T21:17:33Z TilingProfiler finished after 0.094 seconds
2023-09-08T21:17:33Z
2023-09-08T21:17:33Z Diagnostic information:
2023-09-08T21:17:33Z NeuronX Compiler version 2.9.0.40+07376825f
2023-09-08T21:17:33Z
2023-09-08T21:17:33Z Python version 3.8.10
2023-09-08T21:17:33Z HWM version 2.9.0.2-f79d59e7b
2023-09-08T21:17:33Z NumPy version 1.21.6
2023-09-08T21:17:33Z
2023-09-08T21:17:33Z Running on AMI ami-0d08bfe808787640a
2023-09-08T21:17:33Z Running in region use1-az5
2023-09-08T21:17:33Z
2023-09-08T21:17:33Z Diagnostic logs stored in /home/ubuntu/log-neuron-cc.txt
2023-09-08T21:17:22Z INFO 238269 [root]: /opt/aws_neuron_venv_pytorch/bin/neuronx-cc compile sdxl_compile_dir_832x1216/vae_decoder/model --framework XLA --target trn1 --output sdxl_compile_dir_832x1216/vae_decoder/graph.neff
2023-09-08T21:17:22Z INFO 238334 [root]: TVM/Relay detected
2023-09-08T21:17:22Z INFO 238334 [root]: Pipeline: Frontend HHChecker WalrusDriver BIRLinker Kelper
2023-09-08T21:17:22Z INFO 238334 [root]: Intermediate files stored in /home/ubuntu/neuronxcc-5l2tcm31, output in /home/ubuntu
2023-09-08T21:17:22Z INFO 238334 [pipeline.Pipeline.0]: Job Pipeline len(in_states) 1
2023-09-08T21:17:22Z INFO 238334 [pipeline.Pipeline.0]: Processing input #0
2023-09-08T21:17:22Z INFO 238334 [pipeline.Pipeline.0]: Running pipeline Pipeline.0
2023-09-08T21:17:22Z INFO 238334 [pipeline.Pipeline.0]: Starting job job.Frontend.0
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: Job Frontend len(in_states) 1
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: Processing input #0
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: Start model loading
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: IR signature: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 for model
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: Executing: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/neuronxcc/starfish/bin/hlo2penguin --input /home/ubuntu/sdxl_compile_dir_832x1216/vae_decoder/model --out-dir ./ --output penguin.py --layers-per-module=1 --coalesce-all-gathers=false --coalesce-reduce-scatters=false --coalesce-all-reduces=false --emit-tensor-level-dropout-ops --emit-tensor-level-rng-ops
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]:
Histogram before graph level optimizations:
total HLO instructions: 1614
broadcast 452 28.00% ################################################################
reshape 364 22.55% ###################################################
constant 294 18.22% #########################################
multiply 167 10.35% #######################
add 113 7.00% ################
transpose 57 3.53% ########
convolution 35 2.17% ####
batch-norm-training 30 1.86% ####
get-tuple-element 30 1.86% ####
tanh 29 1.80% ####
divide 16 0.99% ##
call 15 0.93% ##
dot 6 0.37%
reduce 2 0.12%
exponential 1 0.06%
parameter 1 0.06%
subtract 1 0.06%
tuple 1 0.06%
Histogram before graph level optimizations:
total HLO instructions: 1614
broadcast 452 28.00% ################################################################
reshape 364 22.55% ###################################################
constant 294 18.22% #########################################
multiply 167 10.35% #######################
add 113 7.00% ################
transpose 57 3.53% ########
convolution 35 2.17% ####
batch-norm-training 30 1.86% ####
get-tuple-element 30 1.86% ####
tanh 29 1.80% ####
divide 16 0.99% ##
call 15 0.93% ##
dot 6 0.37%
reduce 2 0.12%
exponential 1 0.06%
parameter 1 0.06%
subtract 1 0.06%
tuple 1 0.06%
INFO: IoStatistics: total inputs: 1
INFO: IoStatistics: total outputs: 1
INFO: IoStatistics: total passthrough tensors: 0
INFO: IoStatistics: total outputs read from: 0
INFO: IoStatistics: total redundant outputs: 0
Replaced 0 dropout sequences with OffloadedDropout
INFO: HloMacCount has found 5025528358400
INFO: Traffic has found 12393472
INFO: AIF 810996.04
Histogram after graph level optimizations:
total HLO instructions: 758
constant 143 18.87% ################################################################
multiply 118 15.57% ####################################################
add 113 14.91% ##################################################
broadcast 110 14.51% #################################################
reshape 73 9.63% ################################
transpose 49 6.46% #####################
convolution 35 4.62% ###############
batch-norm-training 30 3.96% #############
get-tuple-element 30 3.96% #############
tanh 29 3.83% ############
custom-call 15 1.98% ######
dot 6 0.79% ##
reduce 2 0.26%
exponential 1 0.13%
parameter 1 0.13%
divide 1 0.13%
subtract 1 0.13%
tuple 1 0.13%
HLO Ops used in computation: add batch-norm-training broadcast constant convolution custom-call divide dot exponential get-tuple-element multiply parameter reduce reshape subtract tanh transpose tuple
Invoking RemoveOptimizationBarriers pass
Invoking NeuronInstCombine pass.
Total SqrtMul sequences deleted = 0
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: Start tensorization
2023-09-08T21:17:22Z WARNING 238334 [job.Frontend.0]: TVM not detected.
2023-09-08T21:17:23Z INFO 238334 [job.Frontend.0]: Num parallel jobs: 1
2023-09-08T21:17:23Z INFO 238334 [root/Tensorizer/All]: Enter time region
2023-09-08T21:17:23Z INFO 238334 [Tensorizer]: Frontend found a single CU. Switching to flat flow.
2023-09-08T21:17:23Z INFO 238334 [Tensorizer]: Building model from Penguin script "penguin.py"...
2023-09-08T21:17:23Z INFO 238334 [Tensorizer]: Tensorizer options: --disable-bitcasted-transpose --dont-verify-after-all --fp32-cast=matmult-bf16 --mm-transpose-type=fp32 --disable-expensive-checks --disable-max-stride-tiling --enable-replication --max-local-tensor-tile-size-in-bytes=32768 --tensor-layout-p-order=0 --tensor-layout-b-order=1 --enable-advanced-delinearization --weight-coalescing-threshold=512 --enable-bir-converter=enable --sunda-batchnorm --enable-tritium-loopfusion --keep-remat-dma-transpose --enable-softmax-kernel
2023-09-08T21:17:23Z INFO 238334 [Tensorizer]: Building model from Penguin script "penguin.py"...
2023-09-08T21:17:23Z INFO 238334 [Tensorizer]: Successfully built model.
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/DoNothing]: Running DoNothing
2023-09-08T21:17:23Z INFO 238334 [DoNothing]: Finished (changed=True)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/DoNothing]: DoNothing finished after 0.000 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/CanonicalizeIR]: Running CanonicalizeIR
2023-09-08T21:17:23Z INFO 238334 [CanonicalizeIR]: Finished (changed=True)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/CanonicalizeIR]: CanonicalizeIR finished after 0.018 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/ExpandBatchNorm]: Running ExpandBatchNorm
2023-09-08T21:17:23Z INFO 238334 [ExpandBatchNorm]: Finished (changed=True)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/ExpandBatchNorm]: ExpandBatchNorm finished after 0.057 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/ResolveComplicatePredicates]: Running ResolveComplicatePredicates
2023-09-08T21:17:23Z INFO 238334 [ResolveComplicatePredicates]: Finished (changed=False)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/ResolveComplicatePredicates]: ResolveComplicatePredicates finished after 0.017 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/AffinePredicateResolution]: Running AffinePredicateResolution
2023-09-08T21:17:23Z INFO 238334 [AffinePredicateResolution]: Finished (changed=False)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/AffinePredicateResolution]: AffinePredicateResolution finished after 0.019 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/EliminateDivs]: Running EliminateDivs
2023-09-08T21:17:23Z INFO 238334 [EliminateDivs]: Finished (changed=False)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/EliminateDivs]: EliminateDivs finished after 0.018 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/PerfectLoopNest]: Running PerfectLoopNest
2023-09-08T21:17:23Z INFO 238334 [PerfectLoopNest]: Finished (changed=False)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/PerfectLoopNest]: PerfectLoopNest finished after 0.016 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:24Z INFO 238334 [Simplifier]: Finished (changed=True)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.223 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: Running GenericAccessSimplifier
2023-09-08T21:17:24Z INFO 238334 [GenericAccessSimplifier]: Finished (changed=False)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: GenericAccessSimplifier finished after 0.015 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TCTransform]: Running TCTransform
2023-09-08T21:17:24Z INFO 238334 [TCTransform]: Finished (changed=True)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TCTransform]: TCTransform finished after 0.027 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/CommuteConcat]: Running CommuteConcat
2023-09-08T21:17:24Z INFO 238334 [CommuteConcat]: Finished (changed=False)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/CommuteConcat]: CommuteConcat finished after 0.016 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TensorOpFusion]: Running TensorOpFusion
2023-09-08T21:17:24Z INFO 238334 [TensorOpFusion]: Finished (changed=True)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TensorOpFusion]: TensorOpFusion finished after 0.018 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TensorOpTransform]: Running TensorOpTransform
2023-09-08T21:17:24Z INFO 238334 [TensorOpTransform]: Finished (changed=True)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TensorOpTransform]: TensorOpTransform finished after 0.060 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/LowerTensorOp]: Running LowerTensorOp
2023-09-08T21:17:24Z INFO 238334 [LowerTensorOp]: Finished (changed=True)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/LowerTensorOp]: LowerTensorOp finished after 0.017 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/MemcpyElimination]: Running MemcpyElimination
2023-09-08T21:17:25Z INFO 238334 [MemcpyElimination]: Finished (changed=True)
2023-09-08T21:17:25Z USER 238334 [sg0000/Tensorizer/MemcpyElimination]: MemcpyElimination finished after 1.058 seconds
2023-09-08T21:17:25Z USER 238334 [sg0000/Tensorizer/LoopFusion]: Running LoopFusion
2023-09-08T21:17:26Z INFO 238334 [LoopFusion]: Finished (changed=True)
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/LoopFusion]: LoopFusion finished after 1.182 seconds
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:26Z INFO 238334 [Simplifier]: Finished (changed=False)
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.112 seconds
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/Delinearization]: Running Delinearization
2023-09-08T21:17:26Z INFO 238334 [Delinearization]: Finished (changed=True)
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/Delinearization]: Delinearization finished after 0.052 seconds
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/DeadStoreElimination]: Running DeadStoreElimination
2023-09-08T21:17:28Z INFO 238334 [DeadStoreElimination]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/DeadStoreElimination]: DeadStoreElimination finished after 1.288 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:28Z INFO 238334 [Simplifier]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.116 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: Running LICM
2023-09-08T21:17:28Z INFO 238334 [LICM]: Finished (changed=True)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: LICM finished after 0.064 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Delinearization]: Running Delinearization
2023-09-08T21:17:28Z INFO 238334 [Delinearization]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Delinearization]: Delinearization finished after 0.019 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LoopFusion]: Running LoopFusion
2023-09-08T21:17:28Z INFO 238334 [LoopFusion]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LoopFusion]: LoopFusion finished after 0.224 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/SimplifySlice]: Running SimplifySlice
2023-09-08T21:17:28Z INFO 238334 [SimplifySlice]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/SimplifySlice]: SimplifySlice finished after 0.007 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: Running LICM
2023-09-08T21:17:28Z INFO 238334 [LICM]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: LICM finished after 0.019 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:28Z INFO 238334 [Simplifier]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.114 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/ValueNumbering]: Running ValueNumbering
2023-09-08T21:17:28Z INFO 238334 [ValueNumbering]: Finished (changed=True)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/ValueNumbering]: ValueNumbering finished after 0.036 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: Running LICM
2023-09-08T21:17:28Z INFO 238334 [LICM]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: LICM finished after 0.018 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/PadElimination]: Running PadElimination
2023-09-08T21:17:28Z INFO 238334 [PadElimination]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/PadElimination]: PadElimination finished after 0.001 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Delinearization]: Running Delinearization
2023-09-08T21:17:28Z INFO 238334 [Delinearization]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Delinearization]: Delinearization finished after 0.058 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LoopFusion]: Running LoopFusion
2023-09-08T21:17:29Z INFO 238334 [LoopFusion]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LoopFusion]: LoopFusion finished after 0.218 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: Running GenericAccessSimplifier
2023-09-08T21:17:29Z INFO 238334 [GenericAccessSimplifier]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: GenericAccessSimplifier finished after 0.007 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:29Z INFO 238334 [Simplifier]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.111 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LICM]: Running LICM
2023-09-08T21:17:29Z INFO 238334 [LICM]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LICM]: LICM finished after 0.018 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/ValueNumbering]: Running ValueNumbering
2023-09-08T21:17:29Z INFO 238334 [ValueNumbering]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/ValueNumbering]: ValueNumbering finished after 0.024 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/TCTransform]: Running TCTransform
2023-09-08T21:17:29Z INFO 238334 [TCTransform]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/TCTransform]: TCTransform finished after 0.010 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/CommuteConcat]: Running CommuteConcat
2023-09-08T21:17:29Z INFO 238334 [CommuteConcat]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/CommuteConcat]: CommuteConcat finished after 0.008 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/RecognizeOpIdiom]: Running RecognizeOpIdiom
2023-09-08T21:17:29Z INFO 238334 [RecognizeOpIdiom]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/RecognizeOpIdiom]: RecognizeOpIdiom finished after 0.047 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/MaskPropagation]: Running MaskPropagation
2023-09-08T21:17:29Z INFO 238334 [MaskPropagation]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/MaskPropagation]: MaskPropagation finished after 0.023 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Recompute]: Running Recompute
2023-09-08T21:17:29Z INFO 238334 [Recompute]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Recompute]: Recompute finished after 0.001 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DeadCodeElimination]: Running DeadCodeElimination
2023-09-08T21:17:29Z INFO 238334 [DeadCodeElimination]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DeadCodeElimination]: DeadCodeElimination finished after 0.008 seconds
2023-09-08T21:17:29Z INFO 238334 [Tensorizer]: After optimization: 138 statements
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DoNothing]: Running DoNothing
2023-09-08T21:17:29Z INFO 238334 [DoNothing]: Finished (changed=True)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DoNothing]: DoNothing finished after 0.000 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/MutateDataType]: Running MutateDataType
2023-09-08T21:17:29Z INFO 238334 [MutateDataType]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/MutateDataType]: MutateDataType finished after 0.006 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/AutoCastTCInputs]: Running AutoCastTCInputs
2023-09-08T21:17:29Z INFO 238334 [AutoCastTCInputs]: Finished (changed=True)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/AutoCastTCInputs]: AutoCastTCInputs finished after 0.015 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: Running GenericAccessSimplifier
2023-09-08T21:17:29Z INFO 238334 [GenericAccessSimplifier]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: GenericAccessSimplifier finished after 0.009 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:29Z INFO 238334 [Simplifier]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.114 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LegalizeCCOpLayout]: Running LegalizeCCOpLayout
2023-09-08T21:17:29Z INFO 238334 [LegalizeCCOpLayout]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LegalizeCCOpLayout]: LegalizeCCOpLayout finished after 0.008 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: Running DelinearIndices
2023-09-08T21:17:29Z INFO 238334 [DelinearIndices]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: DelinearIndices finished after 0.018 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Delinearization]: Running Delinearization
2023-09-08T21:17:29Z INFO 238334 [Delinearization]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Delinearization]: Delinearization finished after 0.017 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: Running DelinearIndices
2023-09-08T21:17:29Z INFO 238334 [DelinearIndices]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: DelinearIndices finished after 0.018 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DeadCodeElimination]: Running DeadCodeElimination
2023-09-08T21:17:29Z INFO 238334 [DeadCodeElimination]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DeadCodeElimination]: DeadCodeElimination finished after 0.008 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/InferIntrinsicOnCC]: Running InferIntrinsicOnCC
2023-09-08T21:17:29Z INFO 238334 [InferIntrinsicOnCC]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/InferIntrinsicOnCC]: InferIntrinsicOnCC finished after 0.099 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/ResolveAccessConflict]: Running ResolveAccessConflict
2023-09-08T21:17:29Z INFO 238334 [ResolveAccessConflict]: Finished (changed=True)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/ResolveAccessConflict]: ResolveAccessConflict finished after 0.065 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LICM]: Running LICM
2023-09-08T21:17:29Z INFO 238334 [LICM]: Finished (changed=True)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LICM]: LICM finished after 0.056 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LocalLayoutOpt]: Running LocalLayoutOpt
2023-09-08T21:17:29Z INFO 238334 [LocalLayoutOpt]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LocalLayoutOpt]: LocalLayoutOpt finished after 0.053 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: Running DelinearIndices
2023-09-08T21:17:29Z INFO 238334 [DelinearIndices]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: DelinearIndices finished after 0.019 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/OrigLayoutTilingPipeline]: Running OrigLayoutTilingPipeline
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/GlobalLayoutOpt]: Running GlobalLayoutOpt
2023-09-08T21:17:31Z INFO 238334 [GlobalLayoutOpt]: Finished (changed=True)
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/GlobalLayoutOpt]: GlobalLayoutOpt finished after 1.704 seconds
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/CanonicalizeDAG]: Running CanonicalizeDAG
2023-09-08T21:17:31Z INFO 238334 [CanonicalizeDAG]: Finished (changed=True)
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/CanonicalizeDAG]: CanonicalizeDAG finished after 0.082 seconds
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/FlattenAxesForTiling]: Running FlattenAxesForTiling
2023-09-08T21:17:31Z INFO 238334 [FlattenAxesForTiling]: Finished (changed=True)
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/FlattenAxesForTiling]: FlattenAxesForTiling finished after 0.075 seconds
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/SundaSizeTiling]: Running SundaSizeTiling
Trying to execute: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb
After cloning the LLama-13-b repo from Huggingface, I get the following content
config.json
is missing the the code is complaining about it.
Hello everybody. I got this error when trying to compile unet for sd 1.5. Even after reducing the image dimension to 256, the issue persists. Do you guys have any suggestions?
2023-12-26T08:37:54Z ERROR 26199 [job.WalrusDriver.0]: Backend exited with code -9 and stderr:
2023-12-26T08:37:54Z INFO 26191 [root]: Subcommand returned with exitcode=-9
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: ***************************************************************
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: An Internal Compiler Error has occurred
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: ***************************************************************
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: [F137] neuronx-cc was forcibly killed - This most commonly occurs due to insufficient system memory. Using a smaller data type, dimensions, batch size, or a larger instance type may help.
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: Internal details:
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: Type: <class 'RuntimeError'>
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: File "neuronxcc/driver/CommandDriver.py", line 329, in neuronxcc.driver.CommandDriver.CommandDriver.run
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: Diagnostic information:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: NeuronX Compiler version 2.12.54.0+f631c2365
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: Python version 3.8.10
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: HWM version 2.12.0.0-422c9037c
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: NumPy version 1.24.4
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: Running on AMI ami-0fdb13d8e11515ea4
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: Running in region use1-az4
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: Diagnostic logs stored in /home/ubuntu/dungtt/AI-Art/log-neuron-cc.txt
Hi!
I am trying to convert an SD1.5 based model with neuronx following this example
https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_sd2_512_inference.ipynb
What I did was:
1)launch an aws ec2 inf2.8xlarge instance
2)run
sudo apt-get install linux-headers-$(uname -r) -y
sudo apt-get install aws-neuronx-dkms --allow-change-held-packages -y
source /opt/aws_neuron_venv_pytorch/bin/activate
3)Follow the guide for sd2, but commented out the cross_atention modification and changed the shape of encoder_hidden_states to match the shape of SD1.5
All parts except unet compile fine, but unet fails with an error.
Here is the code that fails and attached are the error log and traceback log:
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
unet = copy.deepcopy(pipe.unet.unetwrap)
del pipe
sample_1b = torch.randn([1, 4, 64, 64]).bfloat16()
timestep_1b = torch.tensor(999).bfloat16().expand((1,))
encoder_hidden_states_1b = torch.randn([1, 77, 768]).bfloat16()
example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b
unet_neuron = torch_neuronx.trace( unet, example_inputs, compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'), compiler_args=["--model-type=unet-inference"] )
Is this a bug with the neuronx or am I doing something wrong?
Thanks
I am following the example notebook Stable Diffusion 2.1 512x512 but can't compile the model using a inf2.xlarge
instance.
After a number of correctly compiled steps that look like the following:
Compiler status PASS
I get an error message:
2023-05-04 19:31:29.000758: INFO ||NCC_WRAPPER||: Exiting with a successfully compiled graph
Traceback (most recent call last):
File "/pkg/modal/_container_entrypoint.py", line 329, in handle_input_exception
yield
File "/pkg/modal/_container_entrypoint.py", line 402, in call_function_sync
res = fun(*args, **kwargs)
File "/root/sd_2_1_inf.py", line 152, in compile_model
decoder_neuron = torch_neuronx.trace(
File "/usr/local/lib/python3.9/site-packages/torch_neuronx/xla_impl/trace.py", line 309, in trace
neff_filename = hlo_compile(model_dir, compiler_workdir, compiler_args)
File "/usr/local/lib/python3.9/site-packages/torch_neuronx/xla_impl/trace.py", line 232, in hlo_compile
raise RuntimeError(f'neuronx-cc failed with {status}')
RuntimeError: neuronx-cc failed with -9
Is this a known issue? What's the recommended setup in terms of library versions and instance types to be able to compile Stable Diffusion 2.1?
Please provide support for Tr-OCR Base Printed conversion using Neuron and its inference in jit trace ,
getting this error when trying through the uploaded notebook, the process flow is as below
after meeting tensor shape as 768, encoder complies but decoder fails to compile with neuron command -9
@hyandell @mattmcclean @aws-maens @brunopistone
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.