aws-neuron / aws-neuron-samples Goto Github PK

View Code? Open in Web Editor NEW

117.0 117.0 33.0 8.89 MB

Example code for AWS Neuron SDK developers building inference and training applications

License: Other

Jupyter Notebook 93.51% Python 5.84% Shell 0.55% C++ 0.10% Dockerfile 0.01%

deep-learning machine-learning pytorch tensorflow

aws-neuron-samples's People

Contributors

Stargazers

Watchers

aws-neuron-samples's Issues

I can't set params `optlevel` to `1` with torch_neuronx.trace

My environment is aws server inf2.8xlarge

python : 3.8.10
torch-neuronx : 2.1.1.2.0.1b0
neuronx-cc : 2.12.68.0+4480452af

I'm trying to compile esrgan torch model to neuron but I have an issue.

from PIL import Image
import requests

import torch
import torch_neuronx
from torchvision import models
from torchvision.transforms import functional

from modules.esrgan_upscale import upscale_model_loader
import os
os.environ["NEURON_CC_FLAGS"] = "-O1"
# load the model
model = upscale_model_loader('modules/weight/4x-Ultrasharp.pth')
model.eval()

# Get an example input
image = Image.open('/home/ubuntu/diffusers-ultimate-upscale/testIm.png')
image = image.convert('RGB')
image = functional.to_tensor(image)
image = torch.unsqueeze(image, 0)

# Run inference on CPU
output_cpu = model(image)

# Compile the model
model_neuron = torch_neuronx.trace(model, image,compiler_args=['--optlevel','1'])

# Save the TorchScript for inference deployment
filename = 'model.pt'
torch.jit.save(model_neuron, filename)

when I run this code
first it gives me this log

2024-02-20T13:36:54Z Compilation is optimized for best performance and compilation time. For faster compilation time please use -O1

I want to compile with -O1 because of this error log (yes, i failed compile)

[XTP002] Too many instructions after unroll for function sg0000! - Compiling under --optlevel=1 may result in smaller graphs. If you are using a transformer model, try using a smaller context_length_estimate value.

I can't set the optlevel flag to 1 ... even I changed inside the module code like this

    command = [
        neuron_cc,
        "compile",
        filename,
        "--framework",
        "XLA",
        "--target",
        "trn1",
        "--output",
        neff_filename,
        "--optlevel",
        "1"
    ]
    command.extend(compiler_args)

what should I do if I want to compile with --optlevel=1 with torch_neuronx.trace ?

After successfully compiling, I started to run it, and it reported the following error, there is no code for that country.

TypeError Traceback (most recent call last)
Cell In[5], line 35
24 prompt = ["a photo of an astronaut riding a horse on mars",
25 "sonic on the moon",
26 "elvis playing guitar while eating a hotdog",
(...)
31 "kids playing soccer at the FIFA World Cup"
32 ]
34 # First do a warmup run so all the asynchronous loads can finish
---> 35 image_warmup = pipe(prompt[0]).images[0]
37 plt.title("Image")
38 plt.xlabel("X pixel scaling")

File /opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File /opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py:1174, in StableDiffusionXLPipeline.call(self, prompt, prompt_2, height, width, num_inference_steps, timesteps, denoising_end, guidance_scale, negative_prompt, negative_prompt_2, num_images_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds, ip_adapter_image, ip_adapter_image_embeds, output_type, return_dict, cross_attention_kwargs, guidance_rescale, original_size, crops_coords_top_left, target_size, negative_original_size, negative_crops_coords_top_left, negative_target_size, clip_skip, callback_on_step_end, callback_on_step_end_tensor_inputs, **kwargs)
1172 if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
1173 added_cond_kwargs["image_embeds"] = image_embeds
-> 1174 noise_pred = self.unet(
1175 latent_model_input,
1176 t,
1177 encoder_hidden_states=prompt_embeds,
1178 timestep_cond=timestep_cond,
1179 cross_attention_kwargs=self.cross_attention_kwargs,
1180 added_cond_kwargs=added_cond_kwargs,
1181 return_dict=False,
1182 )[0]
1184 # perform guidance
1185 if self.do_classifier_free_guidance:

File /opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)

File /opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None

TypeError: NeuronUNet.forward() got an unexpected keyword argument 'timestep_cond'

Llama2 quantized model on Inf2 generating nonsense

I am following the steps (https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb) to run a Llama2 quantized model (https://huggingface.co/TheBloke/Dolphin-Llama2-7B-AWQ) on an AWS inf2 instance (Inf2 8x large)

I can run the code however when I try to generate a sequence I get a nonsense output stream

>>> neuron_model.sample(tokenizer.encode("who is prime minister of uk", return_tensors="pt"), sequence_length=2048, top_k=50, streamer=TextStreamer(tokenizer))

isherтак discoveryLENGrektta mel damтакudoudoisherkl̂LENGifarola melLENG destouselikhudocherิuskkl Hauptumar discovery Malludoikh moduleджа moduleelifudouskswerLENGkl discovery discoveryaltraungsusrLENG КурcherLENGLENGToolsivelivelusrungs Haupt geldig modulesivel modulesусrola discoverydelegate Haupt discoveryugeniture moduleselif›ugen Кеede geldig discovery Schl Mallivel HöheLENG audelegatedelegateusr КеedeLENG› Кур КеdelegateLENGudo›usr Mallrellppen›delegateivel Schldelegate accessibleodgeugenumar destдоваусdelegateToolsklundesede Кур Кеkl Mallugenentityikzdelegate discoveryanzen destusrungsppenentitychioíkíkkldelegate КеLENGrellToolsommenсиingu destLENGaussedeugnougnoppenikzíkLENG Mall auLENGrellikzivelugenkldelegatedelegateftyungsichtsdelegate Кеajuси Höheewусundesaju Курусikzík ensuiteichtsewzna ensuiteAccess discoverydelegate Кеdelegateinguboldmath nucitenusr accessibleedeLENGppenikzdelegateichtsdelegateundiallotikz Ке bon Кур Ке Курrell Schldelegate Schlус Ке MallLENGodgeǧikzкурغ Кеanzenlotppenungsdelegateichtsivel moduledelegatedelegaterellundialLENGinguungsivelichtshtusrdelegatedelegatehirehtichtschiohtdelegateedeغajuingu КеungsenschaftLENGLENGajuкурdelegateсиichtsikzտ MallLENGLENGLENGLENG auichtsси КеaussغewкуркурivelLENG modulesichtsLENGungs主 Кеchkikz主ajuichtsewugenichts nucichtsкур Schldelegate bonкурlotajuusrundialdentкуркурrellغikzugenусrelllotugenLENGinguppenchiochkajuhireкурppenichtsдвиhtanzeníkGRichtsichts Schl bon Schlchkchk nucdelegateichts Schlitenitenдви moduleznaajudelegatelotchkanzenlotἱAccessdelegateLENG nucinguchkitenppenусусdelegateкурдвиусikzundialajuenschaftdelegateznaдвикурichtschio Кеewadalichtsreesichtsտchioкурichtsenschaftichtsrell bonikzlot desc Mallкуркурchioсиadalenschaftinguppenusrhireikzivel Кеikzinguppen descdelegateusrikzichtsznaichtsewchkewrellAccessewichtsichtsдвикурikzznaichtslot Schlew nucíkкур nucAccessкурichtschioдвиivel firing nuc ordchiochkhireус auskeichtsodgeadalкурungsichtsewedeikz bonусewadalchkichtsATA主enschaftewusr
jurкурусppenichtsundialajuichtsLENGenschaftedeewichtsдвиppenichts sl nucchkadalкуркурichtsdelegateikzinguLEFTLEFTдвиchkкурchk bonundialundialadalundial Schlodgechk firing bonedeichts Abbкур desc Ке Schl descundialкурznalot auichts Schlclean Кеclean Mallchkadal reciznaadalundialichts formulachio Mallchioкурclean nucусhireATAichtshire desc desc recidelegatechioichtsichtschklotichtsusrichtsungs主rell Кеchioclean sl nucкуркурichtsadalundiallotGRewсиznaewhire主курewichtsкурсиichtsristichtscleanristichts ordAccessichtschkichtsdelegateungshireundialGRristíkodgeGRungs nucкур descLEFTinguLEFTikz Schlhirerellikzungsundial nucichtsкур AbbусewchioAccessodgeATA ```

File not found error

Hello!

I attempted to run the jupyter notebook on an inf2.48xlarge instance and, the following error occurred below:

I'm not sure what was the cause of such error, but this is what the neuron_artifacts generated:

Installed Packages

absl-py==2.1.0
accelerate==0.23.0
aiofiles==23.2.1
aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.4.0
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work
async-timeout==4.0.3
attrs==23.2.0
aws-neuronx-runtime-discovery==2.9
beautifulsoup4==4.12.3
blinker==1.8.2
boto3==1.34.115
botocore==1.34.115
cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
cloud-tpu-client==0.10
coloredlogs==15.0.1
comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1710320294760/work
dataclasses-json==0.6.6
datasets==2.19.1
debugpy @ file:///home/conda/feedstock_root/build_artifacts/debugpy_1707444420542/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
Deprecated==1.2.14
dill==0.3.8
dirtyjson==1.0.8
distro==1.9.0
docutils==0.21.2
duckduckgo_search==6.1.2
ec2-metadata==2.10.0
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work
executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1698579936712/work
filelock==3.14.0
Flask==3.0.3
frozenlist==1.4.1
fsspec==2024.3.1
google-api-core==1.34.1
google-api-python-client==1.8.0
google-auth==2.29.0
google-auth-httplib2==0.2.0
googleapis-common-protos==1.63.0
greenlet==3.0.3
h11==0.14.0
h2==4.1.0
hpack==4.0.0
httpcore==1.0.5
httplib2==0.22.0
httpx==0.27.0
huggingface-hub==0.23.2
humanfriendly==10.0
Hypercorn==0.17.3
hyperframe==6.0.1
idna==3.7
importlib_metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1710971335535/work
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1708996548741/work
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1715263367085/work
islpy==2023.1
itsdangerous==2.2.0
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
jsonpatch==1.33
jsonpointer==2.4
jupyter_client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1716472197302/work
jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1710257277185/work
langchain==0.2.1
langchain-community==0.2.1
langchain-core==0.2.2
langchain-text-splitters==0.2.0
langsmith==0.1.63
libneuronxla==2.0.965
llama-index==0.10.40
llama-index-agent-openai==0.2.5
llama-index-cli==0.1.12
llama-index-core==0.10.40
llama-index-embeddings-huggingface==0.2.1
llama-index-embeddings-openai==0.1.10
llama-index-indices-managed-llama-cloud==0.1.6
llama-index-legacy==0.9.48
llama-index-llms-openai==0.1.21
llama-index-multi-modal-llms-openai==0.1.6
llama-index-program-openai==0.1.6
llama-index-question-gen-openai==0.1.3
llama-index-readers-file==0.1.23
llama-index-readers-llama-parse==0.1.4
llama-parse==0.4.4
llamaindex-py-client==0.1.19
lockfile==0.12.2
MarkupSafe==2.1.5
marshmallow==3.21.2
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1713250518406/work
minijinja==2.0.1
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
mypy-extensions==1.0.0
nest_asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1705850609492/work
networkx==2.6.3
neuronx-cc==2.13.66.0+6dfecc895
neuronx-distributed==0.7.0
nltk==3.8.1
numpy==1.25.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
oauth2client==4.1.3
openai==1.30.5
optimum==1.18.1
optimum-neuron==0.0.22
orjson==3.10.3
outcome==1.3.0.post0
packaging==23.2
pandas==2.2.2
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1712320355065/work
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1706113125309/work
pgzip==0.3.5
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
pillow==10.3.0
platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1715777629804/work
priority==2.0.0
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1702399386289/work
protobuf==3.19.6
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1705722392846/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work
pyarrow==16.1.0
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pydantic==2.7.2
pydantic_core==2.18.3
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1714846767233/work
pyparsing==3.1.2
pypdf==4.2.0
pyreqwest_impersonate==0.4.6
PySocks==1.7.1
python-daemon==3.0.1
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1709299778482/work
python-dotenv==1.0.1
pytz==2024.1
PyYAML==6.0.1
pyzmq @ file:///home/conda/feedstock_root/build_artifacts/pyzmq_1715024398995/work
Quart==0.19.6
regex==2024.5.15
requests==2.32.3
requests-unixsocket==0.3.0
rsa==4.9
s3transfer==0.10.1
safetensors==0.4.3
scikit-learn==1.5.0
scipy==1.11.2
selenium==4.21.0
sentence-transformers==2.7.0
sentencepiece==0.2.0
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
sniffio==1.3.1
sortedcontainers==2.4.0
soupsieve==2.5
SQLAlchemy==2.0.30
stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work
striprtf==0.0.26
sympy==1.12.1
taskgroup==0.0.0a4
tenacity==8.3.0
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.15.2
tomli==2.0.1
torch==2.1.2
torch-neuronx==2.1.2.2.1.0
torch-xla==2.1.2
torchvision==0.16.2
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1708363098266/work
tqdm==4.66.4
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1713535121073/work
transformers==4.36.2
transformers-neuronx==0.10.0.21
trio==0.25.1
trio-websocket==0.11.1
triton==2.1.0
typing-inspect==0.9.0
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1712329955671/work
tzdata==2024.1
uritemplate==3.0.1
urllib3==2.2.1
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1704731205417/work
Werkzeug==3.0.3
wrapt==1.16.0
wsproto==1.2.0
xxhash==3.4.1
yarl==1.9.4
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1695255097490/work

Steps to reproduce:

Running the code (exactly) on https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-3-70b-sampling.ipynb

Would love to get some support on this!

TypeError: Got unsupported ScalarType BFloat16

Hello Everyone,

I am trying to follow the directions in https://aws.amazon.com/blogs/machine-learning/maximize-stable-diffusion-performance-and-lower-inference-costs-with-aws-inferentia2/. I am not sure what I am doing wrong and would love some help! Thanks in advance!

Simple Env

My environment looks as follows:
instance: inf2.8xlarge
ami: aws ec2 describe-images --region us-west-2 --owners amazon --filters 'Name=name,Values=Deep Learning AMI Neuron PyTorch 1.13.? (Ubuntu 20.04) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text

Error

> source /opt/aws_neuron_venv_pytorch/bin/activate
> jupyter nbconvert --to script hf_pretrained_sd2_512_inference.ipynb 
> cp hf_pretrained_sd2_512_inference.py seth_test.py
> python seth_test.py 
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 210524.91it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ubuntu/Developer/run/seth_test.py:189 in <module>                                          │
│                                                                                                  │
│   186 encoder_hidden_states_1b = torch.randn([1, 77, 1024], dtype=DTYPE)                         │
│   187 example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b                          │
│   188                                                                                            │
│ ❱ 189 unet_neuron = torch_neuronx.trace(                                                         │
│   190 │   unet,                                                                                  │
│   191 │   example_inputs,                                                                        │
│   192 │   compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'),                          │
│                                                                                                  │
│ /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:265 in  │
│ trace                                                                                            │
│                                                                                                  │
│   262 │   │   hlo_filename = os.path.join(model_dir, 'graph.hlo')                                │
│   263 │   │                                                                                      │
│   264 │   │   # Write weights to disk                                                            │
│ ❱ 265 │   │   weight_paths = write_params(model_dir, constant_parameter_tensors)                 │
│   266 │   │                                                                                      │
│   267 │   │   table = {                                                                          │
│   268 │   │   │   "model_files": "graph.hlo",                                                    │
│                                                                                                  │
│ /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:306 in  │
│ write_params                                                                                     │
│                                                                                                  │
│   303 │                                                                                          │
│   304 │   # Write tensor data to disk                                                            │
│   305 │   for name, weight in weights.items():                                                   │
│ ❱ 306 │   │   np.save(f'{directory}/weights/{name}.npy', weight.numpy())                         │
│   307 │                                                                                          │
│   308 │   # Write mapping file. Paths are relative to the directory                              │
│   309 │   weight_paths = {name: f'weights/{name}.npy' for name in weights}                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: Got unsupported ScalarType BFloat16

PIP versions

Python Dependencies:

(aws_neuron_venv_pytorch) ubuntu@ip-172-31-1-65:~/Developer/run$ pip freeze
absl-py==1.4.0
accelerate==0.16.0
aiofiles==22.1.0
aiohttp==3.8.4
aiosignal==1.3.1
aiosqlite==0.19.0
amqp==5.1.1
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
astroid==2.15.4
asttokens==2.2.1
async-timeout==4.0.2
attrs==23.1.0
Automat==22.10.0
aws-neuronx-runtime-discovery==2.9
awscli==1.27.126
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.12.2
billiard==3.6.4.0
bleach==6.0.0
boto3==1.26.126
botocore==1.29.126
build==0.10.0
cachetools==5.3.0
celery==5.2.7
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.1.0
click==8.1.3
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.2.0
cloud-tpu-client==0.10
cloudpickle==2.2.1
cmake==3.26.3
colorama==0.4.4
comm==0.1.3
constantly==15.1.0
contourpy==1.0.7
cryptography==40.0.2
cssselect==1.2.0
cycler==0.11.0
dask==2023.4.1
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
diffusers==0.14.0
dill==0.3.6
distlib==0.3.6
docutils==0.16
dparse==0.6.2
exceptiongroup==1.1.1
executing==1.2.0
fastapi==0.95.1
fastjsonschema==2.16.3
filelock==3.12.0
fonttools==4.39.3
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2023.4.0
google-api-core==1.34.0
google-api-python-client==1.8.0
google-auth==2.17.3
google-auth-httplib2==0.1.0
googleapis-common-protos==1.59.0
httpie==3.2.1
httplib2==0.22.0
huggingface-hub==0.14.1
hyperlink==21.0.0
idna==3.4
imageio==2.28.1
importlib-metadata==6.6.0
importlib-resources==5.12.0
incremental==22.10.0
iniconfig==2.0.0
install==1.3.5
ipykernel==6.22.0
ipython==8.12.2
ipython-genutils==0.2.0
ipywidgets==8.0.6
islpy==2022.1.1
isoduration==20.11.0
isort==5.12.0
itemadapter==0.8.0
itemloaders==1.1.0
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
json5==0.9.11
jsonpointer==2.3
jsonschema==4.17.3
jupyter-events==0.6.3
jupyter-ydoc==0.2.4
jupyter_client==8.2.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_fileid==0.9.0
jupyter_server_terminals==0.4.4
jupyter_server_ydoc==0.8.0
jupyterlab==3.6.3
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
jupyterlab_server==2.22.1
kiwisolver==1.4.4
kombu==5.2.4
lazy-object-proxy==1.9.0
libneuronxla==0.5.205
llvmlite==0.40.0
locket==1.0.0
lockfile==0.12.2
lxml==4.9.2
markdown-it-py==2.2.0
MarkupSafe==2.1.2
matplotlib==3.7.1
matplotlib-inline==0.1.6
mccabe==0.7.0
mdurl==0.1.2
mistune==2.0.5
multidict==6.0.4
nbclassic==1.0.0
nbclient==0.7.4
nbconvert==7.3.1
nbformat==5.8.0
nest-asyncio==1.5.6
networkx==2.6.3
neuronx-cc==2.6.0.19+3d819e565
neuronx-hwm==2.6.0.0+826e77395
notebook==6.5.4
notebook_shim==0.2.3
numba==0.57.0
numpy==1.21.6
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauth2client==4.1.3
opencv-python==4.7.0.72
packaging==21.3
pandas==2.0.1
pandocfilters==1.5.0
parsel==1.8.1
parso==0.8.3
partd==1.4.0
pexpect==4.8.0
pgzip==0.3.4
pickleshare==0.7.5
Pillow==9.5.0
pip-tools==6.13.0
pipenv==2023.2.4
pkg_resources==0.0.0
pkgutil_resolve_name==1.3.10
platformdirs==3.5.0
plotly==5.14.1
pluggy==1.0.0
prometheus-client==0.16.0
prompt-toolkit==3.0.38
Protego==0.2.1
protobuf==3.20.3
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==1.10.7
PyDispatcher==2.0.7
Pygments==2.15.1
pylint==2.17.3
pyOpenSSL==23.1.1
pyparsing==3.0.9
pyproject_hooks==1.0.0
pyrsistent==0.19.3
PySocks==1.7.1
pytest==7.3.1
python-daemon==3.0.1
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3
PyYAML==5.4.1
pyzmq==25.0.2
queuelib==1.6.2
regex==2023.5.5
requests==2.29.0
requests-file==1.5.1
requests-toolbelt==1.0.0
requests-unixsocket==0.3.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.3.5
rsa==4.7.2
ruamel.yaml==0.17.22
ruamel.yaml.clib==0.2.7
s3transfer==0.6.0
safetensors==0.3.1
scikit-learn==1.2.2
scipy==1.7.3
Scrapy==2.8.0
seaborn==0.12.2
Send2Trash==1.8.2
service-identity==21.1.0
shap==0.41.0
six==1.16.0
slicer==0.0.7
sniffio==1.3.0
soupsieve==2.4.1
stack-data==0.6.2
starlette==0.26.1
tenacity==8.2.2
terminado==0.17.1
threadpoolctl==3.1.0
tinycss2==1.2.1
tldextract==3.4.1
tokenizers==0.13.3
toml==0.10.2
tomli==2.0.1
tomlkit==0.11.8
toolz==0.12.0
torch==1.13.1
torch-neuronx==1.13.0.1.6.1
torch-xla==1.13.0+torchneuron5
torchvision==0.14.0
tornado==6.3.1
tqdm==4.65.0
traitlets==5.9.0
transformers==4.30.2
Twisted==22.10.0
typing_extensions==4.5.0
tzdata==2023.3
uri-template==1.2.0
uritemplate==3.0.1
urllib3==1.26.15
vine==5.0.0
virtualenv==20.23.0
virtualenv-clone==0.5.7
w3lib==2.1.1
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.1
widgetsnbextension==4.0.7
wrapt==1.15.0
y-py==0.5.9
yarl==1.9.2
ypy-websocket==0.8.2
zipp==3.15.0
zope.interface==6.0

unet compile failed at hf_pretrained_sdxl_base_1024_inference

When executed "hf_pretrained_sdxl_base_1024_inference" then process will failed at "torch.jit.save(unet_neuron, unet_filename)" and the kernel will dead to save the file.

Can't get ControlNet on Inf2 to work

Hi team,
I am adapting this notebook, essentially instantiating a ControlNet pipe, such as

controlnet = ControlNetModel.from_pretrained("DionTimmer/controlnet_qrcode-control_v1p_sd15",
                                             torch_dtype=torch.float16)

pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    safety_checker=None,
    torch_dtype=torch.float16
)

and then going for torch_neuronx.trace
I suspect the number 1 blocker is the fact that the main libs the notebooks suggest to install

!pip install diffusers==0.14.0 transformers==4.30.2 accelerate==0.16.0 safetensors==0.3.1 matplotlib

are too "old" for ControlNet. For instance, transformers and accelerate need to be upgraded.
This prompts the update of other dependencies and then I end up with an env that is completely different that the originally recommended one.
This causes, for instance, torch complaining about CUDA (among other things), whereas we are on Inf2 (this is confusing).
Tried multiple times but somehow couldn't get very far.
I also tried with the latest Neuron release and I can't get it to work.

Any help would be massively appreciated!

Unable to trace SDXL VAE decoder with a different dimension

In torch-neuronx/inference/hf_pretrained_sdxl_1024_inference.ipynb, I tried to change [1, 4, 128, 128] to [1, 4, 104, 152] and it didn't work; more specifically I was able to trace the unet and post_quant_conv with such shape but not with the decoder.

Here's the error I got:

2023-09-08T21:17:33Z Too many instructions after unroll for function sg0000 !
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File <timed exec>:10

File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:323, in trace(func, example_inputs, states, input_output_aliases, compiler_workdir, compiler_args, options)
    320     compiler_workdir = context.name
    322 with context:
--> 323     neff_filename, metaneff, flattener, packer = _trace(
    324         func,
    325         example_inputs,
    326         states,
    327         input_output_aliases,
    328         compiler_workdir,
    329         compiler_args,
    330         options,
    331     )
    332     return create_neuron_model(
    333         neff_filename,
    334         metaneff,
   (...)
    338         input_output_aliases,
    339     )

File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:416, in _trace(func, example_inputs, states, input_output_aliases, compiler_workdir, compiler_args, options)
    413     handle.write(hlo.SerializeToString())
    415 # Compile HLO to NEFF
--> 416 neff_filename = hlo_compile(model_dir, compiler_workdir, compiler_args)
    418 metaneff = hlo_metaneff(hlo, input_parameter_names, updated_input_output_aliases)
    420 return neff_filename, metaneff.SerializeToString(), flattener, packer

File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:281, in hlo_compile(filename, compiler_workdir, compiler_args)
    274     elif status == -11:
    275         logger.warning(
    276             "The neuronx-cc (neuron compiler) crashed (SEGFAULT). "
    277             "This is likely due to a bug in the compiler.  "
    278             "Please lodge an issue at 'https://github.com/aws/aws-neuron-sdk/issues'"
    279         )
--> 281     raise RuntimeError(f"neuronx-cc failed with {status}")
    283 return neff_filename

RuntimeError: neuronx-cc failed with 70

And the text print out before the error:

2023-09-08T21:17:23Z Running DoNothing
2023-09-08T21:17:23Z DoNothing finished after 0.000 seconds
2023-09-08T21:17:23Z Running CanonicalizeIR
2023-09-08T21:17:23Z CanonicalizeIR finished after 0.018 seconds
2023-09-08T21:17:23Z Running ExpandBatchNorm
2023-09-08T21:17:23Z ExpandBatchNorm finished after 0.057 seconds
2023-09-08T21:17:23Z Running ResolveComplicatePredicates
2023-09-08T21:17:23Z ResolveComplicatePredicates finished after 0.017 seconds
2023-09-08T21:17:23Z Running AffinePredicateResolution
2023-09-08T21:17:23Z AffinePredicateResolution finished after 0.019 seconds
2023-09-08T21:17:23Z Running EliminateDivs
2023-09-08T21:17:23Z EliminateDivs finished after 0.018 seconds
2023-09-08T21:17:23Z Running PerfectLoopNest
2023-09-08T21:17:23Z PerfectLoopNest finished after 0.016 seconds
2023-09-08T21:17:23Z Running Simplifier
2023-09-08T21:17:24Z Simplifier finished after 0.223 seconds
2023-09-08T21:17:24Z Running GenericAccessSimplifier
2023-09-08T21:17:24Z GenericAccessSimplifier finished after 0.015 seconds
2023-09-08T21:17:24Z Running TCTransform
2023-09-08T21:17:24Z TCTransform finished after 0.027 seconds
2023-09-08T21:17:24Z Running CommuteConcat
2023-09-08T21:17:24Z CommuteConcat finished after 0.016 seconds
2023-09-08T21:17:24Z Running TensorOpFusion
2023-09-08T21:17:24Z TensorOpFusion finished after 0.018 seconds
2023-09-08T21:17:24Z Running TensorOpTransform
2023-09-08T21:17:24Z TensorOpTransform finished after 0.060 seconds
2023-09-08T21:17:24Z Running LowerTensorOp
2023-09-08T21:17:24Z LowerTensorOp finished after 0.017 seconds
2023-09-08T21:17:24Z Running MemcpyElimination
2023-09-08T21:17:25Z MemcpyElimination finished after 1.058 seconds
2023-09-08T21:17:25Z Running LoopFusion
2023-09-08T21:17:26Z LoopFusion finished after 1.182 seconds
2023-09-08T21:17:26Z Running Simplifier
2023-09-08T21:17:26Z Simplifier finished after 0.112 seconds
2023-09-08T21:17:26Z Running Delinearization
2023-09-08T21:17:26Z Delinearization finished after 0.052 seconds
2023-09-08T21:17:26Z Running DeadStoreElimination
2023-09-08T21:17:28Z DeadStoreElimination finished after 1.288 seconds
2023-09-08T21:17:28Z Running Simplifier
2023-09-08T21:17:28Z Simplifier finished after 0.116 seconds
2023-09-08T21:17:28Z Running LICM
2023-09-08T21:17:28Z LICM finished after 0.064 seconds
2023-09-08T21:17:28Z Running Delinearization
2023-09-08T21:17:28Z Delinearization finished after 0.019 seconds
2023-09-08T21:17:28Z Running LoopFusion
2023-09-08T21:17:28Z LoopFusion finished after 0.224 seconds
2023-09-08T21:17:28Z Running SimplifySlice
2023-09-08T21:17:28Z SimplifySlice finished after 0.007 seconds
2023-09-08T21:17:28Z Running LICM
2023-09-08T21:17:28Z LICM finished after 0.019 seconds
2023-09-08T21:17:28Z Running Simplifier
2023-09-08T21:17:28Z Simplifier finished after 0.114 seconds
2023-09-08T21:17:28Z Running ValueNumbering
2023-09-08T21:17:28Z ValueNumbering finished after 0.036 seconds
2023-09-08T21:17:28Z Running LICM
2023-09-08T21:17:28Z LICM finished after 0.018 seconds
2023-09-08T21:17:28Z Running PadElimination
2023-09-08T21:17:28Z PadElimination finished after 0.001 seconds
2023-09-08T21:17:28Z Running Delinearization
2023-09-08T21:17:28Z Delinearization finished after 0.058 seconds
2023-09-08T21:17:28Z Running LoopFusion
2023-09-08T21:17:29Z LoopFusion finished after 0.218 seconds
2023-09-08T21:17:29Z Running GenericAccessSimplifier
2023-09-08T21:17:29Z GenericAccessSimplifier finished after 0.007 seconds
2023-09-08T21:17:29Z Running Simplifier
2023-09-08T21:17:29Z Simplifier finished after 0.111 seconds
2023-09-08T21:17:29Z Running LICM
2023-09-08T21:17:29Z LICM finished after 0.018 seconds
2023-09-08T21:17:29Z Running ValueNumbering
2023-09-08T21:17:29Z ValueNumbering finished after 0.024 seconds
2023-09-08T21:17:29Z Running TCTransform
2023-09-08T21:17:29Z TCTransform finished after 0.010 seconds
2023-09-08T21:17:29Z Running CommuteConcat
2023-09-08T21:17:29Z CommuteConcat finished after 0.008 seconds
2023-09-08T21:17:29Z Running RecognizeOpIdiom
2023-09-08T21:17:29Z RecognizeOpIdiom finished after 0.047 seconds
2023-09-08T21:17:29Z Running MaskPropagation
2023-09-08T21:17:29Z MaskPropagation finished after 0.023 seconds
2023-09-08T21:17:29Z Running Recompute
2023-09-08T21:17:29Z Recompute finished after 0.001 seconds
2023-09-08T21:17:29Z Running DeadCodeElimination
2023-09-08T21:17:29Z DeadCodeElimination finished after 0.008 seconds
2023-09-08T21:17:29Z Running DoNothing
2023-09-08T21:17:29Z DoNothing finished after 0.000 seconds
2023-09-08T21:17:29Z Running MutateDataType
2023-09-08T21:17:29Z MutateDataType finished after 0.006 seconds
2023-09-08T21:17:29Z Running AutoCastTCInputs
2023-09-08T21:17:29Z AutoCastTCInputs finished after 0.015 seconds
2023-09-08T21:17:29Z Running GenericAccessSimplifier
2023-09-08T21:17:29Z GenericAccessSimplifier finished after 0.009 seconds
2023-09-08T21:17:29Z Running Simplifier
2023-09-08T21:17:29Z Simplifier finished after 0.114 seconds
2023-09-08T21:17:29Z Running LegalizeCCOpLayout
2023-09-08T21:17:29Z LegalizeCCOpLayout finished after 0.008 seconds
2023-09-08T21:17:29Z Running DelinearIndices
2023-09-08T21:17:29Z DelinearIndices finished after 0.018 seconds
2023-09-08T21:17:29Z Running Delinearization
2023-09-08T21:17:29Z Delinearization finished after 0.017 seconds
2023-09-08T21:17:29Z Running DelinearIndices
2023-09-08T21:17:29Z DelinearIndices finished after 0.018 seconds
2023-09-08T21:17:29Z Running DeadCodeElimination
2023-09-08T21:17:29Z DeadCodeElimination finished after 0.008 seconds
2023-09-08T21:17:29Z Running InferIntrinsicOnCC
2023-09-08T21:17:29Z InferIntrinsicOnCC finished after 0.099 seconds
2023-09-08T21:17:29Z Running ResolveAccessConflict
2023-09-08T21:17:29Z ResolveAccessConflict finished after 0.065 seconds
2023-09-08T21:17:29Z Running LICM
2023-09-08T21:17:29Z LICM finished after 0.056 seconds
2023-09-08T21:17:29Z Running LocalLayoutOpt
2023-09-08T21:17:29Z LocalLayoutOpt finished after 0.053 seconds
2023-09-08T21:17:29Z Running DelinearIndices
2023-09-08T21:17:29Z DelinearIndices finished after 0.019 seconds
2023-09-08T21:17:29Z Running OrigLayoutTilingPipeline
2023-09-08T21:17:29Z Running GlobalLayoutOpt
2023-09-08T21:17:31Z GlobalLayoutOpt finished after 1.704 seconds
2023-09-08T21:17:31Z Running CanonicalizeDAG
2023-09-08T21:17:31Z CanonicalizeDAG finished after 0.082 seconds
2023-09-08T21:17:31Z Running FlattenAxesForTiling
2023-09-08T21:17:31Z FlattenAxesForTiling finished after 0.075 seconds
2023-09-08T21:17:31Z Running SundaSizeTiling
2023-09-08T21:17:33Z SundaSizeTiling finished after 1.930 seconds
2023-09-08T21:17:33Z OrigLayoutTilingPipeline finished after 3.809 seconds
2023-09-08T21:17:33Z Running TilingProfiler
2023-09-08T21:17:33Z TilingProfiler finished after 0.094 seconds
2023-09-08T21:17:33Z 
2023-09-08T21:17:33Z Diagnostic information:
2023-09-08T21:17:33Z   NeuronX Compiler version 2.9.0.40+07376825f
2023-09-08T21:17:33Z   
2023-09-08T21:17:33Z   Python version 3.8.10
2023-09-08T21:17:33Z   HWM version 2.9.0.2-f79d59e7b
2023-09-08T21:17:33Z   NumPy version 1.21.6
2023-09-08T21:17:33Z   
2023-09-08T21:17:33Z   Running on AMI ami-0d08bfe808787640a
2023-09-08T21:17:33Z   Running in region use1-az5
2023-09-08T21:17:33Z 
2023-09-08T21:17:33Z Diagnostic logs stored in /home/ubuntu/log-neuron-cc.txt

Lastly the log-neuron-cc.txt:

2023-09-08T21:17:22Z INFO 238269 [root]: /opt/aws_neuron_venv_pytorch/bin/neuronx-cc compile sdxl_compile_dir_832x1216/vae_decoder/model --framework XLA --target trn1 --output sdxl_compile_dir_832x1216/vae_decoder/graph.neff
2023-09-08T21:17:22Z INFO 238334 [root]: TVM/Relay detected
2023-09-08T21:17:22Z INFO 238334 [root]: Pipeline: Frontend HHChecker WalrusDriver BIRLinker Kelper
2023-09-08T21:17:22Z INFO 238334 [root]: Intermediate files stored in /home/ubuntu/neuronxcc-5l2tcm31, output in /home/ubuntu
2023-09-08T21:17:22Z INFO 238334 [pipeline.Pipeline.0]: Job Pipeline len(in_states) 1
2023-09-08T21:17:22Z INFO 238334 [pipeline.Pipeline.0]: Processing input #0
2023-09-08T21:17:22Z INFO 238334 [pipeline.Pipeline.0]: Running pipeline Pipeline.0
2023-09-08T21:17:22Z INFO 238334 [pipeline.Pipeline.0]: Starting job job.Frontend.0
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: Job Frontend len(in_states) 1
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: Processing input #0
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: Start model loading
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: IR signature: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 for model
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: Executing: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/neuronxcc/starfish/bin/hlo2penguin --input /home/ubuntu/sdxl_compile_dir_832x1216/vae_decoder/model --out-dir ./ --output penguin.py --layers-per-module=1 --coalesce-all-gathers=false --coalesce-reduce-scatters=false --coalesce-all-reduces=false --emit-tensor-level-dropout-ops --emit-tensor-level-rng-ops
2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: 
Histogram before graph level optimizations:
total HLO instructions: 1614
           broadcast       452  28.00% ################################################################
             reshape       364  22.55% ###################################################
            constant       294  18.22% #########################################
            multiply       167  10.35% #######################
                 add       113   7.00% ################
           transpose        57   3.53% ########
         convolution        35   2.17% ####
 batch-norm-training        30   1.86% ####
   get-tuple-element        30   1.86% ####
                tanh        29   1.80% ####
              divide        16   0.99% ##
                call        15   0.93% ##
                 dot         6   0.37% 
              reduce         2   0.12% 
         exponential         1   0.06% 
           parameter         1   0.06% 
            subtract         1   0.06% 
               tuple         1   0.06% 


Histogram before graph level optimizations:
total HLO instructions: 1614
           broadcast       452  28.00% ################################################################
             reshape       364  22.55% ###################################################
            constant       294  18.22% #########################################
            multiply       167  10.35% #######################
                 add       113   7.00% ################
           transpose        57   3.53% ########
         convolution        35   2.17% ####
 batch-norm-training        30   1.86% ####
   get-tuple-element        30   1.86% ####
                tanh        29   1.80% ####
              divide        16   0.99% ##
                call        15   0.93% ##
                 dot         6   0.37% 
              reduce         2   0.12% 
         exponential         1   0.06% 
           parameter         1   0.06% 
            subtract         1   0.06% 
               tuple         1   0.06% 

INFO: IoStatistics: total inputs: 1
INFO: IoStatistics: total outputs: 1
INFO: IoStatistics: total passthrough tensors: 0
INFO: IoStatistics: total outputs read from: 0
INFO: IoStatistics: total redundant outputs: 0
Replaced 0 dropout sequences with OffloadedDropout
INFO: HloMacCount has found 5025528358400
INFO: Traffic has found 12393472
INFO: AIF 810996.04

Histogram after graph level optimizations:
total HLO instructions: 758
            constant       143  18.87% ################################################################
            multiply       118  15.57% ####################################################
                 add       113  14.91% ##################################################
           broadcast       110  14.51% #################################################
             reshape        73   9.63% ################################
           transpose        49   6.46% #####################
         convolution        35   4.62% ###############
 batch-norm-training        30   3.96% #############
   get-tuple-element        30   3.96% #############
                tanh        29   3.83% ############
         custom-call        15   1.98% ######
                 dot         6   0.79% ##
              reduce         2   0.26% 
         exponential         1   0.13% 
           parameter         1   0.13% 
              divide         1   0.13% 
            subtract         1   0.13% 
               tuple         1   0.13% 

HLO Ops used in computation: add batch-norm-training broadcast constant convolution custom-call divide dot exponential get-tuple-element multiply parameter reduce reshape subtract tanh transpose tuple 
Invoking RemoveOptimizationBarriers pass
Invoking NeuronInstCombine pass.
Total SqrtMul sequences deleted = 0

2023-09-08T21:17:22Z INFO 238334 [job.Frontend.0]: Start tensorization
2023-09-08T21:17:22Z WARNING 238334 [job.Frontend.0]: TVM not detected.
2023-09-08T21:17:23Z INFO 238334 [job.Frontend.0]: Num parallel jobs: 1
2023-09-08T21:17:23Z INFO 238334 [root/Tensorizer/All]: Enter time region
2023-09-08T21:17:23Z INFO 238334 [Tensorizer]: Frontend found a single CU. Switching to flat flow.
2023-09-08T21:17:23Z INFO 238334 [Tensorizer]: Building model from Penguin script "penguin.py"...
2023-09-08T21:17:23Z INFO 238334 [Tensorizer]: Tensorizer options: --disable-bitcasted-transpose --dont-verify-after-all --fp32-cast=matmult-bf16 --mm-transpose-type=fp32 --disable-expensive-checks --disable-max-stride-tiling --enable-replication --max-local-tensor-tile-size-in-bytes=32768 --tensor-layout-p-order=0 --tensor-layout-b-order=1 --enable-advanced-delinearization --weight-coalescing-threshold=512 --enable-bir-converter=enable --sunda-batchnorm --enable-tritium-loopfusion --keep-remat-dma-transpose --enable-softmax-kernel
2023-09-08T21:17:23Z INFO 238334 [Tensorizer]: Building model from Penguin script "penguin.py"...
2023-09-08T21:17:23Z INFO 238334 [Tensorizer]: Successfully built model.
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/DoNothing]: Running DoNothing
2023-09-08T21:17:23Z INFO 238334 [DoNothing]: Finished (changed=True)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/DoNothing]: DoNothing finished after 0.000 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/CanonicalizeIR]: Running CanonicalizeIR
2023-09-08T21:17:23Z INFO 238334 [CanonicalizeIR]: Finished (changed=True)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/CanonicalizeIR]: CanonicalizeIR finished after 0.018 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/ExpandBatchNorm]: Running ExpandBatchNorm
2023-09-08T21:17:23Z INFO 238334 [ExpandBatchNorm]: Finished (changed=True)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/ExpandBatchNorm]: ExpandBatchNorm finished after 0.057 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/ResolveComplicatePredicates]: Running ResolveComplicatePredicates
2023-09-08T21:17:23Z INFO 238334 [ResolveComplicatePredicates]: Finished (changed=False)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/ResolveComplicatePredicates]: ResolveComplicatePredicates finished after 0.017 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/AffinePredicateResolution]: Running AffinePredicateResolution
2023-09-08T21:17:23Z INFO 238334 [AffinePredicateResolution]: Finished (changed=False)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/AffinePredicateResolution]: AffinePredicateResolution finished after 0.019 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/EliminateDivs]: Running EliminateDivs
2023-09-08T21:17:23Z INFO 238334 [EliminateDivs]: Finished (changed=False)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/EliminateDivs]: EliminateDivs finished after 0.018 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/PerfectLoopNest]: Running PerfectLoopNest
2023-09-08T21:17:23Z INFO 238334 [PerfectLoopNest]: Finished (changed=False)
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/PerfectLoopNest]: PerfectLoopNest finished after 0.016 seconds
2023-09-08T21:17:23Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:24Z INFO 238334 [Simplifier]: Finished (changed=True)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.223 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: Running GenericAccessSimplifier
2023-09-08T21:17:24Z INFO 238334 [GenericAccessSimplifier]: Finished (changed=False)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: GenericAccessSimplifier finished after 0.015 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TCTransform]: Running TCTransform
2023-09-08T21:17:24Z INFO 238334 [TCTransform]: Finished (changed=True)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TCTransform]: TCTransform finished after 0.027 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/CommuteConcat]: Running CommuteConcat
2023-09-08T21:17:24Z INFO 238334 [CommuteConcat]: Finished (changed=False)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/CommuteConcat]: CommuteConcat finished after 0.016 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TensorOpFusion]: Running TensorOpFusion
2023-09-08T21:17:24Z INFO 238334 [TensorOpFusion]: Finished (changed=True)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TensorOpFusion]: TensorOpFusion finished after 0.018 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TensorOpTransform]: Running TensorOpTransform
2023-09-08T21:17:24Z INFO 238334 [TensorOpTransform]: Finished (changed=True)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/TensorOpTransform]: TensorOpTransform finished after 0.060 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/LowerTensorOp]: Running LowerTensorOp
2023-09-08T21:17:24Z INFO 238334 [LowerTensorOp]: Finished (changed=True)
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/LowerTensorOp]: LowerTensorOp finished after 0.017 seconds
2023-09-08T21:17:24Z USER 238334 [sg0000/Tensorizer/MemcpyElimination]: Running MemcpyElimination
2023-09-08T21:17:25Z INFO 238334 [MemcpyElimination]: Finished (changed=True)
2023-09-08T21:17:25Z USER 238334 [sg0000/Tensorizer/MemcpyElimination]: MemcpyElimination finished after 1.058 seconds
2023-09-08T21:17:25Z USER 238334 [sg0000/Tensorizer/LoopFusion]: Running LoopFusion
2023-09-08T21:17:26Z INFO 238334 [LoopFusion]: Finished (changed=True)
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/LoopFusion]: LoopFusion finished after 1.182 seconds
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:26Z INFO 238334 [Simplifier]: Finished (changed=False)
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.112 seconds
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/Delinearization]: Running Delinearization
2023-09-08T21:17:26Z INFO 238334 [Delinearization]: Finished (changed=True)
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/Delinearization]: Delinearization finished after 0.052 seconds
2023-09-08T21:17:26Z USER 238334 [sg0000/Tensorizer/DeadStoreElimination]: Running DeadStoreElimination
2023-09-08T21:17:28Z INFO 238334 [DeadStoreElimination]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/DeadStoreElimination]: DeadStoreElimination finished after 1.288 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:28Z INFO 238334 [Simplifier]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.116 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: Running LICM
2023-09-08T21:17:28Z INFO 238334 [LICM]: Finished (changed=True)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: LICM finished after 0.064 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Delinearization]: Running Delinearization
2023-09-08T21:17:28Z INFO 238334 [Delinearization]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Delinearization]: Delinearization finished after 0.019 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LoopFusion]: Running LoopFusion
2023-09-08T21:17:28Z INFO 238334 [LoopFusion]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LoopFusion]: LoopFusion finished after 0.224 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/SimplifySlice]: Running SimplifySlice
2023-09-08T21:17:28Z INFO 238334 [SimplifySlice]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/SimplifySlice]: SimplifySlice finished after 0.007 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: Running LICM
2023-09-08T21:17:28Z INFO 238334 [LICM]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: LICM finished after 0.019 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:28Z INFO 238334 [Simplifier]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.114 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/ValueNumbering]: Running ValueNumbering
2023-09-08T21:17:28Z INFO 238334 [ValueNumbering]: Finished (changed=True)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/ValueNumbering]: ValueNumbering finished after 0.036 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: Running LICM
2023-09-08T21:17:28Z INFO 238334 [LICM]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LICM]: LICM finished after 0.018 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/PadElimination]: Running PadElimination
2023-09-08T21:17:28Z INFO 238334 [PadElimination]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/PadElimination]: PadElimination finished after 0.001 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Delinearization]: Running Delinearization
2023-09-08T21:17:28Z INFO 238334 [Delinearization]: Finished (changed=False)
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/Delinearization]: Delinearization finished after 0.058 seconds
2023-09-08T21:17:28Z USER 238334 [sg0000/Tensorizer/LoopFusion]: Running LoopFusion
2023-09-08T21:17:29Z INFO 238334 [LoopFusion]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LoopFusion]: LoopFusion finished after 0.218 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: Running GenericAccessSimplifier
2023-09-08T21:17:29Z INFO 238334 [GenericAccessSimplifier]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: GenericAccessSimplifier finished after 0.007 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:29Z INFO 238334 [Simplifier]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.111 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LICM]: Running LICM
2023-09-08T21:17:29Z INFO 238334 [LICM]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LICM]: LICM finished after 0.018 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/ValueNumbering]: Running ValueNumbering
2023-09-08T21:17:29Z INFO 238334 [ValueNumbering]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/ValueNumbering]: ValueNumbering finished after 0.024 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/TCTransform]: Running TCTransform
2023-09-08T21:17:29Z INFO 238334 [TCTransform]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/TCTransform]: TCTransform finished after 0.010 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/CommuteConcat]: Running CommuteConcat
2023-09-08T21:17:29Z INFO 238334 [CommuteConcat]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/CommuteConcat]: CommuteConcat finished after 0.008 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/RecognizeOpIdiom]: Running RecognizeOpIdiom
2023-09-08T21:17:29Z INFO 238334 [RecognizeOpIdiom]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/RecognizeOpIdiom]: RecognizeOpIdiom finished after 0.047 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/MaskPropagation]: Running MaskPropagation
2023-09-08T21:17:29Z INFO 238334 [MaskPropagation]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/MaskPropagation]: MaskPropagation finished after 0.023 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Recompute]: Running Recompute
2023-09-08T21:17:29Z INFO 238334 [Recompute]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Recompute]: Recompute finished after 0.001 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DeadCodeElimination]: Running DeadCodeElimination
2023-09-08T21:17:29Z INFO 238334 [DeadCodeElimination]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DeadCodeElimination]: DeadCodeElimination finished after 0.008 seconds
2023-09-08T21:17:29Z INFO 238334 [Tensorizer]: After optimization: 138 statements
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DoNothing]: Running DoNothing
2023-09-08T21:17:29Z INFO 238334 [DoNothing]: Finished (changed=True)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DoNothing]: DoNothing finished after 0.000 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/MutateDataType]: Running MutateDataType
2023-09-08T21:17:29Z INFO 238334 [MutateDataType]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/MutateDataType]: MutateDataType finished after 0.006 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/AutoCastTCInputs]: Running AutoCastTCInputs
2023-09-08T21:17:29Z INFO 238334 [AutoCastTCInputs]: Finished (changed=True)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/AutoCastTCInputs]: AutoCastTCInputs finished after 0.015 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: Running GenericAccessSimplifier
2023-09-08T21:17:29Z INFO 238334 [GenericAccessSimplifier]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/GenericAccessSimplifier]: GenericAccessSimplifier finished after 0.009 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Simplifier]: Running Simplifier
2023-09-08T21:17:29Z INFO 238334 [Simplifier]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Simplifier]: Simplifier finished after 0.114 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LegalizeCCOpLayout]: Running LegalizeCCOpLayout
2023-09-08T21:17:29Z INFO 238334 [LegalizeCCOpLayout]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LegalizeCCOpLayout]: LegalizeCCOpLayout finished after 0.008 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: Running DelinearIndices
2023-09-08T21:17:29Z INFO 238334 [DelinearIndices]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: DelinearIndices finished after 0.018 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Delinearization]: Running Delinearization
2023-09-08T21:17:29Z INFO 238334 [Delinearization]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/Delinearization]: Delinearization finished after 0.017 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: Running DelinearIndices
2023-09-08T21:17:29Z INFO 238334 [DelinearIndices]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: DelinearIndices finished after 0.018 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DeadCodeElimination]: Running DeadCodeElimination
2023-09-08T21:17:29Z INFO 238334 [DeadCodeElimination]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DeadCodeElimination]: DeadCodeElimination finished after 0.008 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/InferIntrinsicOnCC]: Running InferIntrinsicOnCC
2023-09-08T21:17:29Z INFO 238334 [InferIntrinsicOnCC]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/InferIntrinsicOnCC]: InferIntrinsicOnCC finished after 0.099 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/ResolveAccessConflict]: Running ResolveAccessConflict
2023-09-08T21:17:29Z INFO 238334 [ResolveAccessConflict]: Finished (changed=True)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/ResolveAccessConflict]: ResolveAccessConflict finished after 0.065 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LICM]: Running LICM
2023-09-08T21:17:29Z INFO 238334 [LICM]: Finished (changed=True)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LICM]: LICM finished after 0.056 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LocalLayoutOpt]: Running LocalLayoutOpt
2023-09-08T21:17:29Z INFO 238334 [LocalLayoutOpt]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/LocalLayoutOpt]: LocalLayoutOpt finished after 0.053 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: Running DelinearIndices
2023-09-08T21:17:29Z INFO 238334 [DelinearIndices]: Finished (changed=False)
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/DelinearIndices]: DelinearIndices finished after 0.019 seconds
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/OrigLayoutTilingPipeline]: Running OrigLayoutTilingPipeline
2023-09-08T21:17:29Z USER 238334 [sg0000/Tensorizer/GlobalLayoutOpt]: Running GlobalLayoutOpt
2023-09-08T21:17:31Z INFO 238334 [GlobalLayoutOpt]: Finished (changed=True)
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/GlobalLayoutOpt]: GlobalLayoutOpt finished after 1.704 seconds
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/CanonicalizeDAG]: Running CanonicalizeDAG
2023-09-08T21:17:31Z INFO 238334 [CanonicalizeDAG]: Finished (changed=True)
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/CanonicalizeDAG]: CanonicalizeDAG finished after 0.082 seconds
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/FlattenAxesForTiling]: Running FlattenAxesForTiling
2023-09-08T21:17:31Z INFO 238334 [FlattenAxesForTiling]: Finished (changed=True)
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/FlattenAxesForTiling]: FlattenAxesForTiling finished after 0.075 seconds
2023-09-08T21:17:31Z USER 238334 [sg0000/Tensorizer/SundaSizeTiling]: Running SundaSizeTiling

Can't compile Stable Diffusion 2.1. 512x512 for inference

I am following the example notebook Stable Diffusion 2.1 512x512 but can't compile the model using a inf2.xlarge instance.

After a number of correctly compiled steps that look like the following:

Compiler status PASS

I get an error message:

2023-05-04 19:31:29.000758: INFO ||NCC_WRAPPER||: Exiting with a successfully compiled graph
Traceback (most recent call last):
  File "/pkg/modal/_container_entrypoint.py", line 329, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 402, in call_function_sync
    res = fun(*args, **kwargs)
  File "/root/sd_2_1_inf.py", line 152, in compile_model
    decoder_neuron = torch_neuronx.trace(
  File "/usr/local/lib/python3.9/site-packages/torch_neuronx/xla_impl/trace.py", line 309, in trace
    neff_filename = hlo_compile(model_dir, compiler_workdir, compiler_args)
  File "/usr/local/lib/python3.9/site-packages/torch_neuronx/xla_impl/trace.py", line 232, in hlo_compile
    raise RuntimeError(f'neuronx-cc failed with {status}')
RuntimeError: neuronx-cc failed with -9

Is this a known issue? What's the recommended setup in terms of library versions and instance types to be able to compile Stable Diffusion 2.1?

Error about missing model when executing LLama2 notebook

After installing the missing tranformers library, I am getting the error:

RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-6501c5f9-3743f5630dd72644195e9e21;e4840366-a585-4130-9e9a-1da114b8ec72)

Repository Not Found for url: https://huggingface.co/Llama-2-13b/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

Inference on llama 13B

prompt = '''Translate English to French:

starfish => étoile de mer
campfire => feu de camp
snowflake => flocon de neige
dragonfly => libellule
maple tree => érable
thunderstorm => orage
seashell => coquillage
waterfall => cascade
hummingbird => colibri
pine cone => pomme de pin
lighthouse => phare
dandelion => pissenlit
cheese =>
'''
input_ids = tokenizer.encode(prompt, return_tensors="pt")

run inference

with torch.inference_mode():
start = time.time()
generated_sequences = neuron_model.sample(input_ids, temperature= 0.1, sequence_length=200, top_p=0.9)
elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

I am expecting the translation of cheese (fromage) as the output. But instead getting the entire prompt as output.

What is the parallel parameter in neuron for return_full_text=False etc? This prompt works well in llama playground but not on neuron. I don't want to generate paragraphs in the output, instead looking to use this for text extraction task.

Can't compile SD2.1 VAE with Batch Input

I have changed the batch sizes of the trace tensor inputs in hf_pretrained_sd2_512_inference.ipynb notebook. Although
text encoder, unet and vae_post_quant_conv were compiled, vae wasn't compiled.

batch=2

import torch_neuronx
from diffusers import StableDiffusionPipeline
import torch
import os, copy

COMPILER_WORKDIR_ROOT = 'sd2_compile_dir_512_batch2'
model_id = "stabilityai/stable-diffusion-2-1-base"

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
decoder = copy.deepcopy(pipe.vae.decoder)
del pipe

decoder_in = torch.randn([2, 4, 64, 64])
decoder_neuron = torch_neuronx.trace(
    decoder, 
    decoder_in, 
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder'),
)

decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
torch.jit.save(decoder_neuron, decoder_filename)

del decoder
del decoder_neuron

I get error message:

Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 234016.96it/s]
Selecting 161763 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 137856 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 272047 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 52275 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 318165 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 8981 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 323589 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
2023-05-22T06:51:17Z WARNING 28201 [SB_Allocator]: couldn't allocate every tensor in SB
2023-05-22T06:51:17Z WARNING 28201 [SB_Allocator]: disabling special handling of accumulation groups
Selecting 323589 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 2233 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
Selecting 325190 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: couldn't allocate every tensor in SB and spilling can't help
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: 10 biggest memlocs:
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_312_pftranspose_5198_i6_ReloadStore32338_ReloadStore166495 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]:
mhlo_add_312_pftranspose_5198_i0_ReloadStore32560_Remat_166496 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i5_ReloadStore32107_Remat_166430 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_259_i0 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i7 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i1 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i6 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i5_ReloadStore32107_Remat_166431 65536
2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i7_ReloadStore32024_Remat_121920_Remat_166327 65536

I have used inf2.8xlarge instance and set 100GB swap space. Any ideas on this batch input compilation problem?

Cannot compile Bart generate-optimised decode

I'm trying to compile Bart for text2text generation on an Inf2 server. I am aware that optimum-neuron has a Bart implementation, but I need to be able to make customizations that are incompatible with the pipeline system.

Bart is implemented so that if you pass past_key_values, you can provide only the last decoder input ID rather than the whole string. This speeds up the attention, so that it's linear per step rather than quadratic time, because it only has to run for one position rather than all positions so far. This is an important compute optimisation.

When I try to trace a call to Bart that uses this optimisation, I get an error:

2023-07-04T12:38:29Z ERROR 26864 [Tensorizer]: Transformation error on operator: mlir.function
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: ***************************************************************
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]:  An Internal Compiler Error has occurred
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: ***************************************************************
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: 
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: Error message:  too many values to unpack (expected 1)

Steps to reproduce:

from transformers import BartForConditionalGeneration, BartTokenizerFast, BartConfig
import copy
import torch
import torch.nn.functional as F
import torch_neuronx
import transformers

model = BartForConditionalGeneration.from_pretrained('facebook/bart-base', torchscript=True)

example_sentence = "Hello, my name is Billy."
tokeniser = BartTokenizerFast.from_pretrained('facebook/bart-base')
tokens = tokeniser(example_sentence, return_tensors='pt')

inputs = (None,None, None, None, None, None, None, (torch.zeros((1, 128, 768)),), None, None, torch.zeros((1, 9, 768)))
outputs = model(*inputs)

class BartForNeuronDecoder(torch.nn.Module):
    
    def __init__(self, bart):
        super().__init__()
        
        self.bart = bart
        
    def forward(
        self,
        decoder_input_ids, # 1 token per batch
        encoder_outputs,
        attention_mask, # for encoder outputs
        past_key_values, # max_len - 1
    ):
        

        outputs = self.bart.model(
            encoder_outputs=encoder_outputs, 
            attention_mask=attention_mask, 
            decoder_input_ids=decoder_input_ids, 
            past_key_values=past_key_values,
            use_cache=True,
        )
        lm_logits = self.bart.lm_head(outputs[0]) + self.bart.final_logits_bias
        
        return (
            lm_logits,
            outputs[1]
        )

wrapped_model = BartForNeuronDecoder(model)

def pad_key_values(past_key_values, max_len):
    padded_key_values = ()
    for layer in past_key_values:
        padded_layer = ()
        for i in [0,1]:
            padded_layer = padded_layer + (F.pad(layer[i], pad=(0,0,0, max_len - 1 - layer[i].shape[2])),)
        padded_layer = padded_layer + layer[2:]
        padded_key_values = padded_key_values + (padded_layer,)
    
    return padded_key_values

pkv = pad_key_values(outputs[1], 128)

args = (torch.tensor([[0]]), (torch.zeros((1, 128, 768)),), torch.tensor([[1,1,1,1,1] + [0]*123]), pkv)
wrapped_model_neuron = torch_neuronx.trace(wrapped_model, args)

Let me know if you have trouble reproducing it and need additional details.

Many thanks.

Clearer instructions on how to get models from Hugging Face

It would be useful to add some information about how to obtain the models from Hugging Face and in particular:

use git clone
install LFS extension for git

Deploy meta-llama-2-13b-sampling.ipynb on inf2.24xlarge

Hi,

Is it possible to deploy meta-llama-2-13b-sampling.ipynb on inf2.24xlarge machine?.

Error when executing `neuron_model.to_neuron()`

Running the notebook https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb on inf2.48x

Getting this error when executing the last cell of the notebook

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[3], line 1
----> 1 neuron_model.to_neuron()

File ~/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:117, in LlamaForSampling.to_neuron(self)
    115 self.decoder_lm_head_for_context = {}
    116 for context_length_estimate in self.context_buckets:
--> 117     model = self.decoder_lm_head.build_weight_shared(
    118         n_positions_list=[context_length_estimate],
    119         n_active_tokens=context_length_estimate,
    120         unroll=self.context_unroll,
    121         share_caches=True,
    122     )
    123     # PERF: No latency improvement seen in multi-layer models from executor
    124     if self.context_unroll == self.config.num_hidden_layers:

File ~/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/decoder.py:157, in DecoderLmHeadForSamplingNoEmbedding.build_weight_shared(self, n_positions_list, n_active_tokens, batch_size, unroll, share_caches)
    155     ln_lm_head_params.append(new.lm_head_bias)
    156 new.program = new._build_program()
--> 157 new.program.setup(new.layers, ln_lm_head_params)
    158 return new

File ~/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/decoder.py:983, in DecoderProgramFullyUnrolled.setup(self, layers, ln_lm_head_params)
    982 def setup(self, layers, ln_lm_head_params):
--> 983     super().setup(layers, ln_lm_head_params)
    984     for npos, memory in zip(self.n_positions_list, self.memories):
    985         input_tensors = [*self.input_buffers]

File ~/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/decoder.py:879, in DecoderProgram.setup(self, layers, ln_lm_head_params)
    876         kernel.neff_bytes = future.result()
    878 for kernel in self.kernels:
--> 879     kernel.load()

File ~/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/compiler.py:375, in ParallelKernel.load(self)
    374 def load(self):
--> 375     assert self.neff_bytes is not None, f"Try to load with neff bytes as None, might due to compilation failure"
    376     self.model = torch.classes.neuron.ParallelModel(self.neff_bytes, self.tp_degree, self.g_start_device_id, self.g_device_count)
    377     self.model.load()

AssertionError: Try to load with neff bytes as None, might due to compilation failure

Neuron Core Inference support for TrOCR

I'm trying to do inference for TrOCR on Inf1 instance. Able to compile and save model as per the notebook but the model execution is happening on CPU right now. Neuron core are unutilized. Please provide a way so that the inference takes use of neuron cores.

import torch
import torch.neuron
from transformers import TrOCRProcessor, VisionEncoderDecoderModel

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-small-handwritten") 
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-small-handwritten").eval()

max_length = 32
input_ids=torch.zeros([1, max_length], dtype=torch.int64)
attention_mask=torch.zeros([1, max_length], dtype=torch.int64)
encoder_hidden_states=torch.rand([1, 578, 384])
pad_size = torch.as_tensor(0)

xenc = torch.rand(1,3,384,384).float()
xdec = (input_ids, attention_mask, encoder_hidden_states, pad_size)

model.encoder.forward_neuron = torch.jit.load('troc_encoder_neuron.pt')
model.decoder.forward_neuron = torch.jit.load('troc_decoder_neuron.pt')

generated_ids = model.generate(xenc, pad_token_id=model.config.decoder.eos_token_id)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

%matplotlib inline
import os
import sys
import cv2
import urllib
import matplotlib.pyplot as plt
import time
if not '..' in sys.path: sys.path.append('..')

def load_sample_imgE():
    if not os.path.exists("text.jpg"):
        urllib.request.urlretrieve("https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg", "text.jpg")
    return cv2.imread("text.jpg")

max_len = 32
img = load_sample_imgE()

for i in range(10):
    pixel_values = processor(img, max_length=max_length, padding='max_length', 
                            truncate=True, return_tensors="pt").pixel_values
    generated_ids = model.generate(pixel_values, pad_token_id=model.config.decoder.eos_token_id, max_length=max_len)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

t1 = time.time()
num_inf = 100
for i in range(num_inf):
    pixel_values = processor(img, max_length=max_length, padding='max_length', 
                            truncate=True, return_tensors="pt").pixel_values
    generated_ids = model.generate(pixel_values, pad_token_id=model.config.decoder.eos_token_id, max_length=max_len)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
t2 = time.time()
print(f"Inf/sec: {num_inf/(t2-t1):.2f}")

print(generated_text)
plt.figure(figsize=(10,5))
plt.imshow(img)

[Question] Double-wrapper for the UNet in `hf_pretrained_sd2_512_inference.ipynb`

Hi guys, I know the reason is somehow explained in the screenshot. Can there be more detailed explanations or specific reason?

meta llama2 13b sampling notebook example error with longer prompt

Model used:
Llama-2-13b-chat-hf

Successfully ran the prompt in notebook example:

prompt = "Hello, I'm a language model,"

input_ids = tokenizer.encode(prompt, return_tensors="pt")

# run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

By it failed by just replacing prompt with a longer text (updated seqence_length to 4096 also gave the same result):

LONG = """Summarize the text below:
---
EXTENDING CONTEXT WINDOW OF LARGE LAN-
GUAGE MODELS VIA POSITION INTERPOLATION

Shouyuan Chen Sherman Wong Liangjian Chen  Yuandong Tian
Meta Platforms Inc.
{chenshouyuan, shermanwong, cli, yuandong}@meta . com

1 INTRODUCTION

Large language models (LLMs) typically come with a pre-defined context window size. For exam-
ple, inputs to LLaMA models (Touvron et al., 2023) must be fewer than 2048 tokens. This pre-set
context window limit is frequently exceeded in applications such as conducting long conversations,
summarizing long documents, or executing long-term planning. For these applications, LLMs with
longer context windows are preferred. However, training an LLM from scratch with long context
windows requires significant investments. This naturally leads to a question: Can we extend the
context window of an existing pre-trained LLM?

One straightforward approach is to fine-tune an existing pre-trained Transformer with a longer con-
text window. However, empirically, we found that models trained this way adapt to long context
windows very slowly. After training for more than 10000 batches, the effective context window
saw a minimal increase, moving from 2048 to 2560 (Table 4). This suggests that such method is
inefficient for extending to substantially longer context windows.

While certain techniques such as ALiBi (Press et al., 2022) and LeX (Sun et al., 2022) enable length
extrapolation of Transformers, i.e. train on short context windows and inference on longer ones,
many existing pre-trained LLMs, including LLaMA (Touvron et al., 2023), use positional encodings
that have weak extrapolation properties (e.g., RoPE (Su et al., 2021)). Therefore, the applicability
of these techniques for extending the context window sizes of such LLMs remains limited.

In this work, we introduce Position Interpolation to enable context window extensions for certain
existing pre-trained LLMs, including LLaMA. The key idea is, instead of extrapolation, we directly
down-scale the position indices so that the maximum position index matches the previous context
window limit in the pre-training stage. See Figure 1 for an illustration. In other words, to accom-
modate more input tokens, we interpolate the position encodings at neighboring integer positions,
utilizing the fact that position encodings can be applied on non-integer positions, as opposed to
extrapolating outside the trained positions, which may lead to catastrophic values. We verify our
approach theoretically, by showing that the interpolated attention score has a much smaller upper

bound (~ 600x smaller in LLaMA 7B setting) than the extrapolated one, and is thus much more
stable. Therefore, interpolated position encodings are easier for the model to adapt.

Empirically, we found that Position Interpolation is highly effective and efficient, requiring only a
very short period of fine-tuning for the model to fully adapt to greatly extended context windows.
We present experimental results for extending the context window to up to 32768 from the initial
2048 across 7B to 65B LLaMA models using Position Interpolation. Our results show that

1. Position Interpolation can easily enable very long context windows (e.g. 32768), requiring
only fine-tuning for 1000 steps on the Pile (Gao et al., 2020) to achieve a good quality.
The cost of fine-tuning is negligible compared to the pre-training costs. This confirms
our hypothesis that it is relatively easy for the models to adapt to interpolated position
encodings.

2. Position Interpolation generates strong models that can effectively make use of much ex-
tended context window. We show that models extended by Position Interpolation enjoy
significant perplexity gains from greatly extended context windows for text modeling, and
we show that the perplexity reduces graceful with the enlargement of context windows.
We also applied Position Interpolation in a long text summarization task, and demonstrate
competitive performances.

3. Position Interpolation preserves model quality relatively well for tasks within its original
context window sizes. We present a variety of evaluation results for the extended LLaMA
models on the original LLaMA benchmark. Compared with original LLaMA models, the
extended LLLaM A models saw a minor degradation on several standard benchmarks within
a 2048 token limit.

Our results highlight the innate ability of Transformer models to “extrapolate to sequence lengths
longer than the ones encountered during training” as hypothesized in the seminal work of Vaswani
et al. (2017). We reaffirm this hypothesis and suggest that the previously known weakness of ex-
trapolating to longer sequences for language modeling (Press et al., 2022) may be due to direct

extrapolation of positional encodings and it can be largely mitigated by interpolating position en-
codings instead.

Concurrent work. Right before our release, we are informed with a concurrent blogpost (Super-
HOT kaiokendev (2023)) that also interpolates positional encoding in RoPE to extend the context
window from 2K to 8K. Recently, open source community picks it up in Reddit post ! and Github
Issues 2, which shows that fine-tuning with LoRA (Hu et al., 2021) also seems to work well. Our
paper shows a full fine-tuning with up to 65B model work well with Position Interpolation, and we
also give theoretical explanations why interpolation achieves much more stable results than extrap-
olation, by showing that the upper bound of interplated attention score is much lower than that of
extrapolated ones.

2 METHOD

2.1 BACKGROUND: ROTARY POSITION EMBEDDING (ROPE)

Transformer models require explicit positional information to be injected, typically in the form of
positional encodings, to represent the order of inputs. We consider Rotary Position Embedding
(ROPE) (Su et al., 2021), which is the position encoding used in the LLLaMA model (Touvron et al.,
2023). Given a position index m € [0, ¢) and an embedding vector x := [zg, 71,..., 241], Where
d is the dimension of the attention head, RoPE defines a vector-valued complex function f{x, m) as
follows

Using RoPE, the self-attention score
is only dependent on relative position m — 7 through trigonometric functions. Here q and k are the
query and key vector for a specific attention head. At each layer, RoPE is applied on both query and
key embeddings for computing attention scores.

2.2 DIRECT EXTRAPOLATION

While the attention score in RoPE only depends on the relative positions, which is what we want,
its extrapolation performance is not great . In particular, when directly extending to larger context
windows unseen in the training, the perplexity may shoot up to very high numbers (i.e., > 10%),
comparable to untrained models.

Ideally, we want to see the model trained on a context window of size L = 2048 to still work
reasonably well on longer context window, but may not have the capability to leverage information
that appears beyond L. For example, to answer a question located at 3000, the model trained on
maximal window size of I = 2048 cannot leverage evidences provided at location 0, but still
can leverage the evidences provided at location 2900. In contrast, in reality we see catastrophic
behaviors, i.e., question at location 3000 cannot be answered correctly, even if the evidences are
located at location 2900.

What is the reason behind? How could this happen if the attention score a,,,—,, decays as the relative
distance |m — n/| increases, according to Section 3.4.3 of (Su et al., 2021), and content from very
far distances should not matter that much? It turns out that the upper bound derived in Section 3.4.3
of (Su et al., 2021) may be too loose: while it indeed decays with respect to |m — nl, the bound
can still be quite large (i.e., the bound can be critically depends on the magnitude of v;) and thus
vacuous. In fact, if we treat all trigonometric functions as basis functions (i.e, ¢;(s) := #93), and
think about Eqn. 2 as basis expansion as the following:

where s is the positional span between a query and a key and h; := (ga; + igaj+1){k2j — tk2j+1)
are complex coefficients depending on q and k (here the definition of h; is exactly the same as the
definition of k; in Sec 3.4.3 in RoPE (Su et al., 2021)). Now the the issue becomes clear: as shown
in Fig. 2, a, can be small in magnitude in the range of [0, 2048], but gives huge values out of the
region. The underlying reason is that the trigonometric family {¢;} (with sufficiently large d) is
a universal approximator and can fit any arbitrary functions. Therefore, for a, there always exist
coefficients {h;} (i.e. key and query) that corresponds to small function values in [0, 2048] but

much larger in regions beyond.

---
"""

prompt = LONG
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=4096, top_k=50)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

Error log:

{
	"name": "StopIteration",
	"message": "",
	"stack": "---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb Cell 18 line 8
      <a href='vscode-notebook-cell://ssh-remote%2Binf2-nv/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb#X33sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5'>6</a> with torch.inference_mode():
      <a href='vscode-notebook-cell://ssh-remote%2Binf2-nv/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb#X33sdnNjb2RlLXJlbW90ZQ%3D%3D?line=6'>7</a>     start = time.time()
----> <a href='vscode-notebook-cell://ssh-remote%2Binf2-nv/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb#X33sdnNjb2RlLXJlbW90ZQ%3D%3D?line=7'>8</a>     generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
      <a href='vscode-notebook-cell://ssh-remote%2Binf2-nv/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb#X33sdnNjb2RlLXJlbW90ZQ%3D%3D?line=8'>9</a>     elapsed = time.time() - start
     <a href='vscode-notebook-cell://ssh-remote%2Binf2-nv/home/ubuntu/efs_project/inf2/meta-llama-2-13b-sampling.ipynb#X33sdnNjb2RlLXJlbW90ZQ%3D%3D?line=10'>11</a> generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]

File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:210, in LlamaForSampling.sample(self, input_ids, sequence_length, start_ids, top_k, top_p, eos_token_override, temperature, streamer)
    207         # Sequence length cannot be greater than n_positions
    208         sequence_length = min(sequence_length, self.max_positions)
--> 210 result = sampling.sample_llama(
    211     self, input_ids, start_ids, sequence_length,
    212     eos_token_id=self.config.eos_token_id if eos_token_override is None else eos_token_override,
    213     top_k=top_k, top_p=top_p, temperature=temperature, streamer=streamer
    214 )
    216 if offset != 0:
    217     result = result[:, offset:]

File /opt/conda/envs/inf2/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/sampling.py:243, in sample_llama(model, input_ids, start_ids, sequence_length, eos_token_id, top_k, top_p, temperature, streamer)
    241 _, start = input_ids.shape
    242 cache_ids = torch.arange(start, dtype=torch.int32)
--> 243 next_token_scores = model(input_ids, cache_ids, start_ids)
    244 return sample_loop_llama(
    245     model, input_ids, start_ids, next_token_scores, sequence_length, eos_token_id, top_k, top_p, temperature, streamer
    246 )

File /opt/conda/envs/inf2/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:179, in LlamaForSampling.forward(self, input_ids, cache_ids, start_ids)
    176 hidden = hidden.transpose(0, -1).contiguous()
    178 if context_length > 1:
--> 179     logits = self.context(hidden, cache_ids, start_ids)
    180 else:
    181     logits = self.decoder_lm_head(hidden, cache_ids, start_ids)

File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:163, in LlamaForSampling.context(self, hidden, cache_ids, start_ids)
    161     cache_ids = torch.as_tensor([i], dtype=torch.int32)
    162     hidden_slice = hidden[:, i:i+1].contiguous()
--> 163     logits = self.decoder_lm_head(hidden_slice, cache_ids, start_ids)
    165 return logits

File /opt/conda/envs/inf2/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/decoder.py:186, in DecoderLmHeadForSamplingNoEmbedding.forward(self, *inputs)
    184 sequence_length = hidden.shape[sequence_dim]
    185 if sequence_length == 1:
--> 186     return self.forward_single(*inputs)
    187 if sequence_length % self.n_active_tokens:
    188     raise ValueError(f'sequence_length={sequence_length} cannot be divided by '
    189                      f'n_active_tokens={self.n_active_tokens}')

File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/decoder.py:173, in DecoderLmHeadForSamplingNoEmbedding.forward_single(self, *inputs)
    165 \"\"\"
    166 Fast-path forward function which avoids as much overhead as possible.
    167 
   (...)
    170 etc.
    171 \"\"\"
    172 _, cache_ids, *_ = inputs
--> 173 bucket_id = self.program.find_bucket_id(cache_ids.item())
    174 if self.use_executor:
    175     return self.program.execute(bucket_id, *inputs, return_ranks=self.return_ranks)

File /opt/conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/decoder.py:903, in DecoderProgram.find_bucket_id(self, length)
    902 def find_bucket_id(self, length):
--> 903     return next(idx for idx, npos in enumerate(self.n_positions_list) if npos >= length)

StopIteration: "
}

Shell script variables confuse devices and cores

Samples shell scripts' variables confuse devices and cores. A Trn1 instances has 16 Neuron Devices (chips), each with 2 cores.

This sample script, on Line 31 shows:

export NEURON_NUM_DEVICES=32

I think, the correct code would be:

export NEURON_NUM_CORES=32

The deprecated Neuron Megatron example script shows it correctly:

NUM_NEURONCORES=32

Request for backwards compatibility with older tensorflow (2.4.0 - 2.7.0) versions. I am unable to trace a calamari OCR model.

The calamari model is trained on tensorflow 2.7.0 which utilises tfaip library and the lowest tensorflow_neuron version 2.7.4 uses a higher keras version. Due to this version mismatch I am unable to load my (calamari-ocr) keras model to the python environment with tensorflow-neuron and not able to trace the model. If I save this model in a tensorflow saved model, the traced model's computation graph only works on CPU and not inferntia chips. This is the repository "https://github.com/Calamari-OCR/calamari".

Dependencies conflict in running Llama-2-13b autoregressive sampling on Inf2

Running notebook - https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb on inf2.48xlarge

Error while running last block - line no 4
from transformers_neuronx.llama.model import LlamaForSampling

results in:

>>> from transformers_neuronx.llama.model import LlamaForSampling
2023-Sep-27 06:59:32.0474 22340:22340 ERROR  TDRV:tdrv_get_dev_info                       No neuron device available
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/llama/model.py", line 17, in <module>
    from transformers_neuronx import decoder
  File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/decoder.py", line 18, in <module>
    from transformers_neuronx import compiler
  File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/compiler.py", line 33, in <module>
    from libneuronxla import neuron_xla_compile
ImportError: cannot import name 'neuron_xla_compile' from 'libneuronxla' (/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/libneuronxla/__init__.py)

huggingface/optimum-neuron#213 - This suggests to update latest version of torch-neuronx. And aws-neuron/transformers-neuronx#33 this suggest specific to torch-neuronx-1.13.1.1.10.0

When tried installing the specific version, it failed with following exception.

python -m pip install torch-neuronx==1.13.1.1.10.0 -U
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting torch-neuronx==1.13.1.1.10.0
  Using cached https://pip.repos.neuron.amazonaws.com/torch-neuronx/torch_neuronx-1.13.1.1.10.0-py3-none-any.whl (2.4 MB)
Requirement already satisfied: torch==1.13.* in ./aws_neuron_venv_pytorch/lib/python3.7/site-packages (from torch-neuronx==1.13.1.1.10.0) (1.13.1)
INFO: pip is looking at multiple versions of torch-neuronx to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement torch-xla==1.13.1+torchneurona (from torch-neuronx) (from versions: 1.0, 1.11.0+torchneuron2, 1.11.0+torchneuron3, 1.12.0+torchneuron3, 1.13.0+torchneuron3, 1.13.0+torchneuron4, 1.13.0+torchneuron5, 1.13.1+torchneuron6, 1.13.1+torchneuron7, 1.13.1+torchneuron8)
ERROR: No matching distribution found for torch-xla==1.13.1+torchneurona

Additional info on different versions available as of now.

pip index versions torch-neuronx
WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
torch-neuronx (1.13.1.1.11.0)
Available versions: 1.13.1.1.11.0, 1.13.1.1.10.1, 1.13.1.1.10.0, 1.13.1.1.9.1, 1.13.1.1.9.0, 1.13.1.1.8.0, 1.13.1.1.7.0, 1.13.0.1.6.1, 1.13.0.1.6.0, 1.13.0.1.5.0, 1.13.0.1.4.0, 1.12.0.1.4.0, 1.11.0.1.2.0, 1.11.0.1.1.1, 1.0
  INSTALLED: 1.13.1.1.9.1
  LATEST:    1.13.1.1.11.0


pip index versions torch-xla
WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
torch-xla (1.13.1+torchneuron8)
Available versions: 1.13.1+torchneuron8, 1.13.1+torchneuron7, 1.13.1+torchneuron6, 1.13.0+torchneuron5, 1.13.0+torchneuron4, 1.13.0+torchneuron3, 1.12.0+torchneuron3, 1.11.0+torchneuron3, 1.11.0+torchneuron2, 1.0
  INSTALLED: 1.13.1+torchneuron8
  LATEST:    1.13.1+torchneuron8

Following packages are installed

anyio==3.7.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
attrs==23.1.0
aws-neuronx-runtime-discovery==2.9
awscli==1.29.54
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.0.0
boto3==1.28.54
botocore==1.31.54
cached-property==1.5.2
cachetools==5.3.1
certifi==2023.7.22
cffi==1.15.1
charset-normalizer==3.2.0
cloud-tpu-client==0.10
colorama==0.4.4
comm==0.1.4
debugpy==1.7.0
decorator==5.1.1
defusedxml==0.7.1
docutils==0.16
ec2-metadata==2.10.0
entrypoints==0.4
environment-kernels==1.2.0
exceptiongroup==1.1.3
fastjsonschema==2.18.0
filelock==3.12.2
fsspec==2023.1.0
google-api-core==1.34.0
google-api-python-client==1.8.0
google-auth==2.23.0
google-auth-httplib2==0.1.1
googleapis-common-protos==1.60.0
httplib2==0.22.0
huggingface-hub==0.16.4
idna==3.4
importlib-metadata==6.7.0
importlib-resources==5.12.0
iniconfig==2.0.0
ipykernel==6.16.2
ipython==7.34.0
ipython-genutils==0.2.0
ipywidgets==8.1.1
islpy==2022.2.1
jedi==0.19.0
Jinja2==3.1.2
jmespath==1.0.1
jsonschema==4.17.3
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-server==1.24.0
jupyter_client==7.4.9
jupyter_core==4.12.0
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.9
libneuronxla==0.5.413
lockfile==0.12.2
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mistune==3.0.1
nbclassic==1.0.0
nbclient==0.7.4
nbconvert==7.6.0
nbformat==5.8.0
nest-asyncio==1.5.8
networkx==2.6.3
neuronx-cc==2.9.0.16+fa12ba55a
neuronx-hwm==2.9.0.1+f79d59e7b
notebook==6.5.6
notebook_shim==0.2.3
numpy==1.21.6
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauth2client==4.1.3
packaging==23.1
pandocfilters==1.5.0
parso==0.8.3
pexpect==4.8.0
pgzip==0.3.5
pickleshare==0.7.5
Pillow==9.5.0
pkgutil_resolve_name==1.3.10
pluggy==1.2.0
prometheus-client==0.17.1
prompt-toolkit==3.0.39
protobuf==3.20.3
psutil==5.9.5
ptyprocess==0.7.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
Pygments==2.16.1
pyparsing==3.1.1
pyrsistent==0.19.3
pytest==7.4.2
python-daemon==3.0.1
python-dateutil==2.8.2
PyYAML==6.0.1
pyzmq==24.0.1
qtconsole==5.4.4
QtPy==2.4.0
regex==2023.8.8
requests==2.31.0
requests-unixsocket==0.3.0
rsa==4.7.2
s3transfer==0.6.2
safetensors==0.3.3
scipy==1.7.3
Send2Trash==1.8.2
sentencepiece==0.1.99
six==1.16.0
sniffio==1.3.0
soupsieve==2.4.1
terminado==0.17.1
tinycss2==1.2.1
tokenizers==0.13.3
tomli==2.0.1
torch==1.13.1
torch-neuronx==1.13.1.1.9.1
torch-xla==1.13.1+torchneuron8
torchvision==0.14.1
tornado==6.2
tqdm==4.66.1
traitlets==5.9.0
transformers==4.30.2
transformers-neuronx==0.7.84
typing_extensions==4.7.1
uritemplate==3.0.1
urllib3==1.26.16
wcwidth==0.2.6
webencodings==0.5.1
websocket-client==1.6.1
wget==3.2
widgetsnbextension==4.0.9
zipp==3.15.0

Llama3 8B 32K sample generates garbage

Model generates only garbage.

Sample: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3-8b-32k-sampling.ipynb

NeuronSDK2.19 PyTorch 1.13.1

aws-neuronx-runtime-discovery 2.9
libneuronxla 0.5.1795
neuronx-cc 2.14.213.0+013d129b
neuronx-distributed 0.8.0
torch-neuronx 1.13.1.1.15.0
torch-xla 1.13.1+torchneuronf
transformers-neuronx 0.11.351

ii aws-neuronx-collectives 2.21.46.0-69b77134b amd64 neuron_ccom built using CMake
ii aws-neuronx-gpsimd-customop-lib 0.11.4.0 amd64 custom_op_trn1_install built using CMake
ii aws-neuronx-gpsimd-tools 0.11.3.0-36dcb86d4 amd64 gpsimd_tools built using CMake
ii aws-neuronx-runtime-lib 2.21.41.0-fb1705f5f amd64 neuron_runtime built using CMake
ii aws-neuronx-tools 2.18.3.0 amd64 Neuron profile and debug tools

Example from the notebook generates:
num_input_tokens: 26828
generated sequence 1. We propose a new gated linear recurrent unit (RG-LRU) that is efficient to compute on TPU-v3. 2. We propose Griffin, a hybrid model that mixes the RG-LRU with local attention. 3. Griffin and Hawk achieve comparable performance to Transformers on downstream tasks. 4. Griffin and Hawk extrapolate to longer sequences than Transformers. 5. Griffin and Hawk are more efficient than Transformers at inference. 6. Griffin and Hawk are efficient at copying and retrieval tasks. 7. Griffin and Hawk are efficient at training. 8. Griffin and Hawk are efficient at inference. 9. Griffin and Hawk are efficient at training. 10. Griffin and Hawk are efficient at inference. 11. Griffin and Hawk are efficient at training. 12. Griffin and Hawk are efficient at inference. 13. Griffin and Hawk are efficient at training. 14. Griffin and Hawk are efficient at inference. 15. Griffin and Hawk are efficient at training. 16. Griffin and Hawk are efficient at inference. 17. Griffin and Hawk are efficient at training. 18. Griffin and ..... and repeats the same thing for the rest of the 32K

<JSON_DOCUMENT>
{"a": invalid text, "b": how are you?}
</JSON_DOCUMENT>
Can you fix the given json document for me, please?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

Output
generated sequence користувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористувачassistantкористув and repeats.

TypeError: Got unsupported ScalarType BFloat16

When running the "Compile the model into an optimized TorchScript and save the TorchScript" step on the torch-neuronx/inference/hf_pretrained_sd2_512_inference.ipynb and got the error of "Got unsupported ScalarType BFloat16"

list of parameters available for generate method in HuggingFaceGenerationModelAdapter class

Where is the list of parameters available for model.generate (huggingface generate support), the last step? I want the output devoid of any text from the prompt.

model_cpu = LlamaForCausalLM.from_pretrained('models--meta-llama--Llama-2-13b-hf/')
model_neuron = neuron_model

Use the `HuggingFaceGenerationModelAdapter` to access the generate API

model = HuggingFaceGenerationModelAdapter(model_cpu.config, model_neuron)

Get a tokenizer and example input

tokenizer = AutoTokenizer.from_pretrained('models--meta-llama--Llama-2-13b-hf/')

tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors='pt', padding=True)

Run inference using temperature

model.reset_generation()

sample_output = model.generate(
input_ids=encoded_input.input_ids,
attention_mask=encoded_input.attention_mask,
do_sample=True,
max_length=256,
temperature=0.7,
)

Saving an attribute

Hi, I am exporting a model using torch neuron but I can't find any reference to save a custom attribute in the model.

For instance I would like to save the dimension of input image as an integer so to be able to get again this value doing something like:

model = torch.jit.load('"/path/to/my/model.pt")
image_size = model.image_size

Running llama-2 7b Chat on inf2.8xlarge machine

When I want to load the model for inference following the steps give on reference file: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb

import os
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling

os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference"

# load meta-llama/Llama-2-7b-chat to the NeuronCores with 2-way tensor parallelism and run compilation
neuron_model = LlamaForSampling.from_pretrained('llama-2-7b-chat-hf-chunked', batch_size=1, tp_degree=2, amp='f16')
neuron_model.to_neuron()

Following is the error I receive,

{
	"name": "RuntimeError",
	"message": "Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-09-19T09:36:54Z Too many instructions after unroll for function sg0000 !
",
	"stack": "---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
\"\"\"
Traceback (most recent call last):
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/compiler.py\", line 411, in compile
    self.build(tag=tag)
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/compiler.py\", line 418, in build
    self.neff_bytes = compile_hlo_module(self.hlo_module, tag)
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/compiler.py\", line 95, in compile_hlo_module
    neff_bytes = neuron_xla_compile(module_bytes, flags, input_format=\"hlo\", platform_target=\"trn1\",
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/libneuronxla/__init__.py\", line 38, in neuron_xla_compile
    _neuron_cc_wrapper.neuron_xla_compile(
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 234, in neuron_xla_compile
    done = check_neff(compile_cache, neff_path,
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 77, in check_neff
    raise(RuntimeError(error_log))
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-09-19T09:36:54Z Too many instructions after unroll for function sg0000 !

\"\"\"

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb Cell 7 line 1
      <a href='vscode-notebook-cell://ssh-remote%2Binf2-instance-deployment-testing/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb#X13sdnNjb2RlLXJlbW90ZQ%3D%3D?line=8'>9</a> # load meta-llama/Llama-2-7b-chat to the NeuronCores with 2-way tensor parallelism and run compilation
     <a href='vscode-notebook-cell://ssh-remote%2Binf2-instance-deployment-testing/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb#X13sdnNjb2RlLXJlbW90ZQ%3D%3D?line=9'>10</a> neuron_model = LlamaForSampling.from_pretrained('llama-2-7b-chat-hf-chunked', batch_size=1, tp_degree=2, amp='f16')
---> <a href='vscode-notebook-cell://ssh-remote%2Binf2-instance-deployment-testing/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb#X13sdnNjb2RlLXJlbW90ZQ%3D%3D?line=10'>11</a> neuron_model.to_neuron()

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:122, in LlamaForSampling.to_neuron(self)
    120 self.decoder_lm_head_for_context = {}
    121 for context_length_estimate in self.context_buckets:
--> 122     model = self.decoder_lm_head.build_weight_shared(
    123         n_positions_list=[context_length_estimate],
    124         n_active_tokens=context_length_estimate,
    125         unroll=self.context_unroll,
    126         share_caches=True,
    127     )
    128     # PERF: No latency improvement seen in multi-layer models from executor
    129     if self.context_unroll == self.config.num_hidden_layers:

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/decoder.py:163, in DecoderLmHeadForSamplingNoEmbedding.build_weight_shared(self, n_positions_list, n_active_tokens, batch_size, unroll, share_caches)
    161     ln_lm_head_params.append(new.lm_head_bias)
    162 new.program = new._build_program()
--> 163 new.program.setup(new.layers, ln_lm_head_params)
    164 return new

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/decoder.py:1029, in DecoderProgramFullyUnrolled.setup(self, layers, ln_lm_head_params)
   1028 def setup(self, layers, ln_lm_head_params):
-> 1029     super().setup(layers, ln_lm_head_params)
   1030     for npos, memory in zip(self.n_positions_list, self.memories):
   1031         input_tensors = [*self.input_buffers]

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/decoder.py:919, in DecoderProgram.setup(self, layers, ln_lm_head_params, io_ring_cache_size)
    917         neff_bytes_futures.append(future)
    918     for kernel, future in zip(self.kernels, neff_bytes_futures):
--> 919         kernel.neff_bytes = future.result()
    921 for kernel in self.kernels:
    922     kernel.load(io_ring_cache_size)

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/concurrent/futures/_base.py:458, in Future.result(self, timeout)
    456     raise CancelledError()
    457 elif self._state == FINISHED:
--> 458     return self.__get_result()
    459 else:
    460     raise TimeoutError()

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)
    401 if self._exception:
    402     try:
--> 403         raise self._exception
    404     finally:
    405         # Break a reference cycle with the exception in self._exception
    406         self = None

RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-09-19T09:36:54Z Too many instructions after unroll for function sg0000 !
"
}

Also, in the reference file, they use tp_degree=24 when working with inf2.48xlarge which has 384 GB of Accelerator Memory, since I am working with inf2.8xlarge with 32 GB of Accelerator memory, I used tp_degree=2

I have the following versions of the dependencies installed,

Requirement already satisfied: neuronx-cc==2.* in /home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages (2.10.0.34+6c8792c6f)
Requirement already satisfied: transformers-neuronx in /home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages (0.7.84)
aws-neuronx-dkms is already the newest version (2.13.4.0).
aws-neuronx-collectives is already the newest version (2.17.9.0-fb6d14044).
aws-neuronx-runtime-lib is already the newest version (2.17.7.0-df62e3f70).
aws-neuronx-tools is already the newest version (2.14.6.0).

Loss NaN results for run_clm

Got the run_clm.py to compile on trn1.32xlarge and also run the actual training. However, it shows loss-NaN and perplexily NaN results.
has this been observed? The directions I followed are from here

/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/numpy/core/_methods.py:178: RuntimeWarning: invalid value encountered in reduce
  ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
^M100%|██████████| 2/2 [00:00<00:00,  2.55it/s]
***** eval metrics *****
  epoch                   =        3.0
  eval_loss               =        nan
  eval_runtime            = 0:00:07.21
  eval_samples            =        240
  eval_samples_per_second =      33.28
  eval_steps_per_second   =      0.277
  perplexity              =        nan

Model Compilation Issue in AWS Neuron Environment

Description

After running the code until the compilation part, the models do not exist. The compilation logs indicate that the process completes without errors, but the expected model file model.pt is missing from the directory sd2_compile_dir_768/unet/.

Steps to Reproduce

Activate the pre-built PyTorch-2.1 environment for Inf2, Trn*:
```
source /opt/aws_neuronx_venv_pytorch_2_1/bin/activate
```
Run the provided template code from the repository:
```
python3 test3.py
```
Observe the logs and check for the existence of the model file in the specified directory.

Expected Behavior

The model file model.pt should be present in the directory sd2_compile_dir_768/unet/ after the compilation process completes.

Actual Behavior

The model file model.pt is missing from the directory sd2_compile_dir_768/unet/.

Compilation Logs

2024-05-30T09:32:51Z Running birverifier
2024-05-30T09:32:52Z birverifier finished after 1.166 seconds
2024-05-30T09:32:52Z Running codegen
2024-05-30T09:32:57Z isa_gen finished after 4.293 seconds
2024-05-30T09:32:58Z dma_desc_gen finished after 1.495 seconds
2024-05-30T09:33:01Z debug_info_gen finished after 2.790 seconds
2024-05-30T09:33:02Z codegen finished after 9.213 seconds
2024-05-30T09:33:02Z Running neff_packager
2024-05-30T09:33:29Z neff_packager finished after 27.627 seconds

Error Message

Traceback (most recent call last):
  File "/home/ubuntu/test3.py", line 124, in <module>
    pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/jit/_serialization.py", line 152, in load
    raise ValueError(f"The provided filename {f} does not exist")  # type: ignore[str-bytes-safe]
ValueError: The provided filename sd2_compile_dir_768/unet/model.pt does not exist

Environment Details

OS: Ubuntu 22.04
AWS Neuron Environment: PyTorch-2.1
Instance Type: Inf2.xlarge

Additional Information

Key	Value
Repository	aws-neuron-samples
Template Used	hf_pretrained_sd2_768_inference.ipynb
Script	`test.py (for compilation)`

Screenshots

Error when executing the LLama2 inference notebook

Trying to run https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb

but getting error about transformers library missing

Fix: added transformers to the pip install...

Using torch_neuronx models for Causal Language Models

The sample code for GPT2 at https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_gpt2_feature_extraction_on_trn1.ipynb recommends that we pad the input before passing to the forward.

torch_neuronx.trace() expects a tensor or tuple of tensor inputs to use for tracing, so we unpack the tokenzier output. Additionally, the input shape that's used duing compilation must match the input shape that's used during inference. To handle this, we pad the inputs to the maximum size that we will see during inference.

But it has been observed that padding to the right for Causal Models leads to inaccurate results as can be seen here: huggingface/transformers#14521 (comment)

Additionally, torch_neuronx supports dynamic input only along its first (batch dimension). Whereas for any Causal LM, the length of the input rises along the sequence dimension after sampling in each subsequent forward pass.

Is there any recommended way/suggestions on how torch_neuronx can be used for Causal Language Models?

Deploy llama2 13b on inf2.24x

https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb

What are the best ways to deploy the above model for fast inference from local machine and also support parallel requests?

RuntimeError: Neuron runtime cannot be initialized; cannot determine the number of available NeuronCores

I tried to run the hf_pretrained_sd2_512_inference.ipynb on inf2.8xlarge with compiler version NeuronX Compiler version 2.10.0.34+6c8792c6f and got the RuntimeError when loading the model even the compile finished successfully.
The message shows "RuntimeError: Neuron runtime cannot be initialized; cannot determine the number of available NeuronCores" when I tried to load the unet onto neuron cores by the following script.
pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)
Any idea?
thanks

Fast loading of neural network to Inf2

I managed to compile the notebook in the samples to load an OPT model in inf2 chips. :slight_smile:

However, at one point I load the network and put it to neuron.

neuron_model = OPTForSampling.from_pretrained('./opt-13b-split', batch_size=2, tp_degree=2, amp='f16')
neuron_model.to_neuron()

and if I take a smaller model and increase the batch size, it can take ages (20 minutes or so).

Since I try to dockerize my network, can I somehow speed that up, such that my containers start up fast on Kubernetes?

SD_1_5 Unet Compile

Hello everybody. I got this error when trying to compile unet for sd 1.5. Even after reducing the image dimension to 256, the issue persists. Do you guys have any suggestions?

2023-12-26T08:37:54Z ERROR 26199 [job.WalrusDriver.0]: Backend exited with code -9 and stderr:
2023-12-26T08:37:54Z INFO 26191 [root]: Subcommand returned with exitcode=-9
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: ***************************************************************
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: An Internal Compiler Error has occurred
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: ***************************************************************
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: [F137] neuronx-cc was forcibly killed - This most commonly occurs due to insufficient system memory. Using a smaller data type, dimensions, batch size, or a larger instance type may help.
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: Internal details:
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: Type: <class 'RuntimeError'>
2023-12-26T08:37:54Z ERROR 26191 [neuronxcc.driver.CommandDriver]: File "neuronxcc/driver/CommandDriver.py", line 329, in neuronxcc.driver.CommandDriver.CommandDriver.run
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: Diagnostic information:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: NeuronX Compiler version 2.12.54.0+f631c2365
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: Python version 3.8.10
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: HWM version 2.12.0.0-422c9037c
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: NumPy version 1.24.4
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: Running on AMI ami-0fdb13d8e11515ea4
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: Running in region use1-az4
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]:
2023-12-26T08:37:54Z USER 26191 [neuronxcc.driver.CommandDriver]: Diagnostic logs stored in /home/ubuntu/dungtt/AI-Art/log-neuron-cc.txt

Running Multiple Models on Different NeuronCores

Thank you for providing various samples. I have a specific question regarding the setup for running multiple models.

Context:

I have two models: ModelA and ModelB.
I want to run inference on both models to obtain outputs: OutputA and OutputB.
After obtaining the outputs, I plan to concatenate OutputA and OutputB and feed them into a third model, ModelC.

Questions:

Can I run ModelA and ModelB on different NeuronCores in parallel? If so, how can I specify a NeuronCore?
Is this setup ideal for achieving low latency?

FYI: I plan to experiment using an inf1 instance.

Is it possible to run YoloV8 on Neuron?

There are examples how to use yolov5-7. Does Neuron support yolov8?

Running llama13b on inf2.24x

The llama13b notebook runs fine on inf2.48x instance. While running it on inf2.24x, I reduced the tp_degree from 24 to 12 but the code throws an error in the following step-

neuron_model = LlamaForSampling.from_pretrained('./Llama-2-13b-split', batch_size=1, tp_degree=12, amp='f16')
neuron_model.to_neuron()

Error
FileNotFoundError: [Errno 2] No such file or directory: 'neuronx-cc'

Is this notebook supported on a 24x instance? Or what else might be missing? The environment setup is the same in both cases.

Decoder fails to compile the MarianMT example notebook

I am trying to compile the MarianMT language translation model for Inf1 instance.

kernel version = 5.4.228-131.415.amzn2.x86_64
Instance type on which the compilation was attempted = Inf1.2xlarge, Amazon Linux 2 AMI,

Following is my pip freeze output

# pip freeze
torch==1.7.1
torch-neuron==1.7.1.2.5.8.0
transformers==4.0.1
tensorflow==1.15.5
sentencepiece==0.1.97
absl-py==1.4.0
astor==0.8.1
attrs==22.2.0
certifi==2022.12.7
charset-normalizer==3.0.1
click==8.1.3
decorator==5.1.1
dmlc-nnvm==1.13.0.0+0
dmlc-topi==1.13.0.0+0
dmlc-tvm==1.13.0.0+0
exceptiongroup==1.1.0
filelock==3.9.0
gast==0.2.2
google-pasta==0.2.0
grpcio==1.51.1
h5py==2.10.0
idna==3.4
importlib-metadata==6.0.0
inferentia-hwm==1.13.0.0+0
iniconfig==2.0.0
islpy==2021.1+aws2021.x.80.0.bld0
joblib==1.2.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.2
Markdown==3.4.1
MarkupSafe==2.1.2
networkx==2.5
neuron-cc==1.13.5.0+7dcf000a6
numpy==1.18.5
opt-einsum==3.3.0
packaging==23.0
Pillow==9.4.0
pluggy==1.0.0
protobuf==3.20.1
pytest==7.2.1
regex==2022.10.31
requests==2.28.2
sacremoses==0.0.53
scipy==1.4.1
six==1.16.0
tensorboard==1.15.0
tensorflow-estimator==1.15.1
termcolor==2.2.0
tokenizers==0.9.4
tomli==2.0.1
tqdm==4.64.1
typing_extensions==4.5.0
urllib3==1.26.14
Werkzeug==2.2.3
wrapt==1.14.1
zipp==3.13.0

I followed this link for
[PyTorch installation].(https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuron/setup/pytorch-install.html)
Then I followed the instructions in
this notebook
I get the following compilation error

aws neuron bug report compilation error.txt

On torch-neuronx 2.1 Beta import xla_backend fails

Error:

I am getting following error when importing xla_backend on torch-neuronx 2.1

## PackagesException has occurred: FileNotFoundError
[Errno 2] No such file or directory: 'libneuronpjrt-path'
  File "/home/ubuntu/efs/git/gpt2-fsdp/policies/wrapping.py", line 5, in <module>
    import torch_xla.distributed.xla_backend
  File "/home/ubuntu/efs/git/gpt2-fsdp/policies/__init__.py", line 2, in <module>
    from .wrapping import *
  File "/home/ubuntu/efs/git/gpt2-fsdp/train_fsdp.py", line 32, in <module>
    import policies
FileNotFoundError: [Errno 2] No such file or directory: 'libneuronpjrt-path'

Reproduce:

import torch_xla.distributed.xla_backend

Neuron OS packages

dpkg --list | grep neuron
ii  aws-neuronx-collectives                    2.20.11.0-c101c322e                      amd64        neuron_ccom built using CMake
ii  aws-neuronx-dkms                           2.15.9.0                                 amd64        aws-neuronx driver in DKMS format.
ii  aws-neuronx-oci-hook                       2.2.45.0                                 amd64        neuron_oci_hook built using CMake
ii  aws-neuronx-runtime-lib                    2.20.11.0-b7d33e68b                      amd64        neuron_runtime built using CMake
ii  aws-neuronx-tools                          2.17.0.0                                 amd64        Neuron profile and debug tools

Pip freeze

aws-neuronx-runtime-discovery==2.9
libneuronxla==2.0.755
neuronx-cc==2.12.68.0+4480452af
neuronx-hwm==2.12.0.0+422c9037c
torch-neuronx==2.1.1.2.0.1b0

Env

PJRT_DEVICE=NEURON

Hardware

trn1.32xlarge

OS

uname -a
Linux ip-172-31-53-77 6.2.0-1017-aws #17~22.04.1-Ubuntu SMP Fri Nov 17 21:07:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Long context inference overhead with Llama NeuronX

Here is a cleaned up GitHub issue request:

Long context inference overhead with Llama NeuronX

I followed the Llama NeuronX tutorial to host Llama2 on Amazon EC2 with NeuronX and TorchServe. The model works well, achieving 50+ tokens/sec as expected.

Issue

However, for my use case the input contexts are 500-3000 tokens. When I provide an example 3000 token context, there is a 10-30 second overhead before the first token is generated. After the first token, the inference speed is 50 tok/sec as expected.

Attempted fixes

I have tried the following to resolve the long context overhead:

Adjusted TorchServe config values for maxWorkers, maxBatchDelay, batchSize - no improvement
Increased max_length parameter to support longer sequences - no improvement
Tried different micro_batch_size and parallelism values - no improvement
Updated all NeuronX libraries to latest versions:

model-config.yaml

minWorkers: 2
maxWorkers: 8  #did not help
maxBatchDelay: 20
responseTimeout: 1080
batchSize: 4 #did not help

handler:
    model_checkpoint_dir: "llama-2-13b-split"
    amp: "bf16"
    tp_degree: 6
    max_length: 100

#did not help either
# micro_batching:
#     micro_batch_size: 8
#     parallelism:
#         preprocess: 4
#         inference: 1
#         postprocess: 4

pip list

torch 1.13.1+cpu
torch-model-archiver 0.9.0b20231026
torch-neuronx 1.13.1.1.12.1
torch-workflow-archiver 0.2.11b20231026
torch-xla 1.13.1+torchneuronc
transformers-neuronx 0.8.268

Log files after startup instance

torchserve --ncs --start --model-store model_store --ts-config config.properties --models llama-2-13b
(aws_neuron_venv_pytorch) ubuntu@ip-10-72-158-249:~/serve/examples/large_models/inferentia2/llama2$ WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-12-04T23:54:37,499 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2023-12-04T23:54:37,501 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-12-04T23:54:37,545 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml
2023-12-04T23:54:37,683 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.9.0
TS Home: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages
Current directory: /home/ubuntu/serve/examples/large_models/inferentia2/llama2
Temp directory: /tmp
Metrics config path: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml
Number of GPUs: 0
Number of CPUs: 96
Max heap size: 30688 M
Python executable: /opt/aws_neuron_venv_pytorch/bin/python
Config file: config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store
Initial Models: llama-2-13b
Log dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs
Metrics dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 96
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: log
Disable system metrics: false
Workflow Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store
Model config: N/A
2023-12-04T23:54:37,689 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2023-12-04T23:54:37,703 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: llama-2-13b
2023-12-04T23:54:37,709 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5
2023-12-04T23:54:37,710 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5/llama-2-13b
2023-12-04T23:54:37,718 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model llama-2-13b
2023-12-04T23:54:37,719 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model llama-2-13b
2023-12-04T23:54:48,067 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model llama-2-13b loaded.
2023-12-04T23:54:48,067 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: llama-2-13b, count: 2
2023-12-04T23:54:48,074 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2023-12-04T23:54:48,074 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9001, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2023-12-04T23:54:48,075 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2023-12-04T23:54:48,272 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40732955932617|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85419082641602|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:364036.0625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:12472.20703125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:3.9|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9000, pid=492260
2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000
2023-12-04T23:54:48,779 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9001, pid=492261
2023-12-04T23:54:48,780 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9001
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492261
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-04T23:54:48,787 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2023-12-04T23:54:48,787 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change null -> WORKER_STARTED
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492260
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-04T23:54:48,788 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10
2023-12-04T23:54:48,788 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change null -> WORKER_STARTED
2023-12-04T23:54:48,790 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2023-12-04T23:54:48,790 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2023-12-04T23:54:48,797 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9001.
2023-12-04T23:54:48,797 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000.
2023-12-04T23:54:48,799 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799
2023-12-04T23:54:48,799 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799
2023-12-04T23:54:48,833 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8
2023-12-04T23:54:48,833 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8
2023-12-04T23:54:48,997 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2023-12-04T23:54:49,000 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2023-12-04T23:54:49,523 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile
2023-12-04T23:54:49,532 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile
2023-12-04T23:54:49,543 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-04T23:54:49,544 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-04T23:54:49,545 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1
2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-04T23:54:49,555 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1
2023-12-04T23:54:58,772 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model
2023-12-04T23:54:58,789 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model
2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40731430053711|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85420608520508|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:342452.0390625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:34056.08203125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:9.6|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:56:01,531 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled
2023-12-04T23:56:01,537 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 72704
2023-12-04T23:56:01,538 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:73466.0|#WorkerName:W-9000-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161
2023-12-04T23:56:01,539 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161
2023-12-04T23:56:02,630 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled
2023-12-04T23:56:02,632 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 73799
2023-12-04T23:56:02,633 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:74560.0|#WorkerName:W-9001-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162
2023-12-04T23:56:02,634 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40730667114258|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.8542137145996|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:330775.37890625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:45732.69140625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:12.7|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208


... some time later when I call the API


2023-12-05T00:00:48,437 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734448
2023-12-05T00:00:48,458 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT to backend at: 1701734448458
2023-12-05T00:00:48,461 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Backend received inference at: 1701734448
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Preprocessing
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - received req=At the far end of town where the Gricklegrass grows and the wind smells slowandsour when it blows and no
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - birds ever sing excepting old crows is the Street of the Lifted Lorax
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And deep in the Gricklegrass some people say if you look deep enough you can still see today where the
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lorax once stood just as long as it could before somebody lifted the Lorax away
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - What was the Lorax Any why was it there And why was it lifted and taken somewhere from the far end of
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - town where the Gricklegrass grows The old Onceler still lives here
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Ask him he knows
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - You wont see the Onceler Dont knock at his door He stays in his Lerkim on top of his store He stays in his
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lerkim cold under the floor where he makes his own clothes out of miffmuffered moof And on special dank
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - midnights in August he peeks out of the shutters and sometimes he speaks and tells how the Lorax was lifted
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - away Hell tell you perhaps if youre willing to pay
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - On the end of a rope he lets down a tin pail and you have to toss in fifteen cents and a nail and the shell of a
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - greatgreatgreat grandfather snail
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then he pulls up the pail makes a most careful count to see if youve paid him the proper amount Then he
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hides what you paid him away in his Snuvv his secret strange hole in his gruvvulous glove Then he grunts I
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - will call you by WhispermaPhone for the secrets I tell you are for your ears alone
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - SLUPP Down slupps the WhispermaPhone to your ear and the old Oncelers whispers are not very clear
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - since they have to come down through a snergelly hose and he sounds as if he had smallish bees up his nose
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Now Ill tell you he says with his teeth sounding gray how the Lorax got lifted and taken away It all started
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - way back such a long long time back
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Way back in the days when the grass was still green and the pond was still wet and the clouds were still clean
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - and the song of the SwomeeSwans rang out in space one morning I came to this glorious place And I first
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - saw the trees The Truffula Trees The brightcolored tufts of the Truffula Trees Mile after mile in the fresh
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - morning breeze
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And under the trees I saw Brown Barbaloots frisking about in their Barbaloot suits as the played in the
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - shade and ate Truffula Fruits From the rippulous pond came the comfortable sound of the HummingFish
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - humming while splashing around
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But those trees Those trees Those Truffula Trees All my life Id been searching for trees such as these The
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - touch of their tufts was much softer than silk And they had the sweet smell of fresh butterfly milk 
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I felt a great leaping of joy in my heart I knew just what Id do I unloaded my cart In no time at all I had built
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - a small shop Then I chopped down a Truffula Tree with one chop And with great skillful skill and with great
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - speedy speed I took the soft tuft And I knitted a Thneed
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - The instant Id finished I heard a gaZump I looked I saw something pop out of the stump of the tree Id
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chopped down It was sort of a man Describe himThats hard I dont know if I can He was shortish and
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - oldish and brownish and mossy And he spoke with a voice that was sharpish and bossy
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Mister He said with a sawdusty sneeze I am the Lorax I speak for the trees I speak for the trees for the trees
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - have no tongues And Im asking you sir at the top of my lungs he was very upset as he shouted and puffed
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Whats that THING youve made out of my Truffula tuft
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Look Lorax I said Theres no cause for alarm I chopped just one tree I am doing no harm Im being quite
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - useful This thing is a Thneed A Thneeds a FineSomethingThatAllPeopleNeed Its a shirt Its a sock Its a
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - glove Its a hat But it has other uses Yes far beyond that You can use it for carpets For pillows For sheets
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Or curtains Or covers for bicycle seats The Lorax said Sir You are crazy with greed There is no one on earth
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - who would buy that fool Thneed
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the very next minute I proved he was wrong For just at that minute a chap came along and he thought
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - that the Thneed I had knitted was great He happily bought it for three ninetyeight I laughed at the Lorax You
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - poor stupid guy You never can tell what some people will buy
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I repeat cried the Lorax I speak for the trees
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Im busy I told him Shut up if you please I rushed cross the room and in no time at all built a radiophone I
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - put in a quick call I called all my brothers and uncles and aunts and I said listen here Heres a wonderful
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chance for the whole Onceler Family to get mighty rich Get over here fast Take the road to North Nitch Turn
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - left at Weehawken Sharp right at South Stitch
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And in no time at all in the factory I built the whole Onceler Family was working full tilt We were all knitting
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds just as busy as bees to the sound of the chopping of Truffula Trees
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then Oh Baby Oh How my business did grow Now chopping one tree at a time was too slow So I quickly
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - invented my SuperAxeHacker which whacked off four Truffula Trees at one smacker We were making
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds four times as fast as before And that Lorax He didnt show up any more
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the next week he knocked on my new office door He snapped Im the Lorax who speaks for the trees
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - which you seem to be chopping as fast as you please But Im also in charge of the Brown Barbaloots who
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - played in the shade in their Barbaloot suits and happily lived eating Truffula Fruits NOWthanks to your
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hacking my trees to the ground theres not enough Truffula Fruit to go round
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And my poor Barbaloots are all getting the crummies because they have gas and no food in their tummies
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - They loved living here But I cant let them stay Theyll have to find food And I hope that they may Good luck
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - boys he cried And he sent them away
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I the Onceler felt sad as I watched them all go BUT business is business And business must grow
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - regardless of crummies in tummies you know
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I meant no harm I most truly did not But I had to grow bigger So bigger I got I biggered my factory I
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - biggered my roads I biggered my wagons I biggered the loads of the Thneeds I shipped out I was shipping
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - them forth to the South To the East To the West To the North I went right on biggeringselling more
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds And I biggered my money which everyone needs 
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 3
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - This story is about 
2023-12-05T00:00:48,508 [INFO ] W-9000-llama-2-13b_1.0 ACCESS_LOG - /127.0.0.1:50848 "POST /predictions/llama-2-13b HTTP/1.1" 200 73
2023-12-05T00:00:48,510 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734448
2023-12-05T00:00:48,511 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,523 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,590 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,658 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,725 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,793 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,860 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,928 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,995 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:09,063 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:09,130 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false

.....

2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:12,609 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Inferance
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,1701734472,beab1a87-913c-4302-9548-c25943c30243, pattern=[METRICS]
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:2.4171749336E7|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:20370.777|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.job.Job - Waiting time ns: 20370777, Backend time ns: 24152110030
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - QueueTime.Milliseconds:20.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 24125
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:27.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_METRICS - HandlerTime.ms:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,requestID:beab1a87-913c-4302-9548-c25943c30243,timestamp:1701734472

Tested manual inference mode from tutorial - same overhead issue

Ask

Is there something I'm missing in the config or use of Llama NeuronX to remove the long context overhead? I would like sub-second initial token latency for 500-3000 token contexts.

The alternative is to deploy with SageMaker, but I don't have that setup because we want to rewrite infrence.py to extract logits and limit Lllama to constrained generation

Let me know if any other details would be helpful in troubleshooting this. Thanks!

Llama 2 70 B on neurons

Could successfully compile llama2 7B on neuronx.

Referring to this notebook-
https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb

But for llama2 70B, getting this error for LlamaForSampling.from_pretrained step-

ValueError: Weight with shape torch.Size([8192, 1024]) cannot be sharded along dimension 1. This results in 21 weight partitions which cannot be distributed to 20 NeuronCores evenly. To fix this issue either the model parameters or the tp_degree must be changed to allow the weight to be evenly split

Downloaded "meta-llama/Llama-2-70b-hf" model from huggingface and pointing to directory 'models--meta-llama--Llama-2-70b-hf/snapshots/' for the above step.

Downloaded model structure does not correspond to the notebook instructions

Trying to execute: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb

After cloning the LLama-13-b repo from Huggingface, I get the following content

config.json is missing the the code is complaining about it.

sample per sec

What is the samples per second and sequence per second of the bert model ran on Trainium 32x?

Support for TR-OCR Base Printed for AWS Neuron(Inferentia)

Please provide support for Tr-OCR Base Printed conversion using Neuron and its inference in jit trace ,
getting this error when trying through the uploaded notebook, the process flow is as below
after meeting tensor shape as 768, encoder complies but decoder fails to compile with neuron command -9
@hyandell @mattmcclean @aws-maens @brunopistone

Is 13B chat model supported?

RuntimeError: init() expected at most 3 argument(s) but received 5 argument(s)

With latest 0.5 version and Neuron SDK 2.12, some tutorials like this https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/gpt-j-6b-sampling.ipynb is hitting error "RuntimeError: init() expected at most 3 argument(s) but received 5 argument(s)":

Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 930/930 [00:00<00:00, 264kB/s]Downloading pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24.2G/24.2G [04:59<00:00, 80.9MB/s]
.....                                                                                                                                                                                           
Compiler status PASS                                                                                                                                                                            
....                                                                                                                                                                                            
Compiler status PASS                                                                                                                                                                            
....                                                                                                                                                                                            
Compiler status PASS                                                                                                                                                                            
....                                                                                                                                                                                            
Compiler status PASS                                                                                                                                                                            
Traceback (most recent call last):                                                                                                                                                              
  File "gptj.py", line 28, in <module>                                                                                                                                                          
    neuron_model.to_neuron()                                                                                                                                                                    
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/gptj/model.py", line 72, in to_neuron                                                             
    self.program.setup(self.transformer.h, self.ln_lm_head)                                                                                                                                     
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/program.py", line 102, in setup                                                                   
    kernel.load()                                                                                                                                                                               
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 376, in load                                                                   
    self.model = torch.classes.neuron.ParallelModel(self.neff_bytes, self.tp_degree, self.g_start_device_id, self.g_device_count)                                                               
RuntimeError: __init__() expected at most 3 argument(s) but received 5 argument(s). Declaration: __init__(__torch__.torch.classes.neuron.ParallelModel _0, str _1, int _2) -> NoneType _0

Issue with compiling SD1.5 based model with neuronx

Hi!
I am trying to convert an SD1.5 based model with neuronx following this example
https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_sd2_512_inference.ipynb

What I did was:
1)launch an aws ec2 inf2.8xlarge instance

2)run
sudo apt-get install linux-headers-$(uname -r) -y
sudo apt-get install aws-neuronx-dkms --allow-change-held-packages -y
source /opt/aws_neuron_venv_pytorch/bin/activate

3)Follow the guide for sd2, but commented out the cross_atention modification and changed the shape of encoder_hidden_states to match the shape of SD1.5

All parts except unet compile fine, but unet fails with an error.

Here is the code that fails and attached are the error log and traceback log:

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
unet = copy.deepcopy(pipe.unet.unetwrap)
del pipe
sample_1b = torch.randn([1, 4, 64, 64]).bfloat16()
timestep_1b = torch.tensor(999).bfloat16().expand((1,))
encoder_hidden_states_1b = torch.randn([1, 77, 768]).bfloat16()
example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b
unet_neuron = torch_neuronx.trace( unet, example_inputs, compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'), compiler_args=["--model-type=unet-inference"] )

traceback.log
error.log

Is this a bug with the neuronx or am I doing something wrong?
Thanks

SDXL Controlnet support

SDXL-base works perfectly on Inf2 chips. Different SDXL pipelines (inpaint, img2img ) are also working perfectly. But, as far as I read/try, there is no support for ControlNet and IPAdapter. Are these features on development roadmaps in future Neuron releases.

Why not tracing VAE Encoder and tokenizer is SD inference?

Hi team,
In this tut, I found it saying that only The VAE_post_quant_conv amd VAE decoder is traced. My question is why not we tracing vae encoder and tokenizer well?

aws-neuron / aws-neuron-samples Goto Github PK

aws-neuron-samples's People

Contributors

Stargazers

Watchers

Forkers

aws-neuron-samples's Issues

Installed Packages

Steps to reproduce:

Simple Env

Error

PIP versions

run inference

Use the HuggingFaceGenerationModelAdapter to access the generate API

Get a tokenizer and example input

Run inference using temperature

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Compilation Logs

Error Message

Environment Details

Additional Information

Screenshots

Error:

Reproduce:

Neuron OS packages

Pip freeze

Env

Hardware

OS

Long context inference overhead with Llama NeuronX

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Use the `HuggingFaceGenerationModelAdapter` to access the generate API