GithubHelp home page GithubHelp logo

nvidia / tensorrt-llm Goto Github PK

View Code? Open in Web Editor NEW
7.0K 84.0 750.0 278.01 MB

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Home Page: https://nvidia.github.io/TensorRT-LLM

License: Apache License 2.0

Python 0.51% CMake 0.01% C++ 99.32% C 0.01% Cuda 0.14% Makefile 0.01% Smarty 0.01% Shell 0.01% Dockerfile 0.01% PowerShell 0.01%

tensorrt-llm's Introduction

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Documentation python cuda trt version license

Architecture   |   Results   |   Examples   |   Documentation


Latest News

TensorRT-LLM Overview

TensorRT-LLM is an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server; a production-quality system to serve LLMs. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

The TensorRT-LLM Python API architecture looks similar to the PyTorch API. It provides a functional module containing functions like einsum, softmax, matmul or view. The layers module bundles useful building blocks to assemble LLMs; like an Attention block, a MLP or the entire Transformer layer. Model-specific components, like GPTAttention or BertAttention, can be found in the models module.

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs. Refer to the Support Matrix for a list of supported models.

To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (refer to support matrix). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete implementation of the SmoothQuant technique.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Community

  • Model zoo (generated by TRT-LLM rel 0.9 a9356d4b7610330e89c1010f342a9ac644215c52)

tensorrt-llm's People

Contributors

a5hwinjs avatar basiccoder avatar byshiue avatar heandres avatar juney-nvidia avatar kaiyux avatar minwhoo avatar sam-india-007 avatar shixiaowei02 avatar sjbae1999 avatar superjomn avatar tp5uiuc avatar whitelok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensorrt-llm's Issues

TensorRT-LLM Releases

Hi TensorRT-LLM users,

We are very pleased to have released the very first public version of TensorRT-LLM. It has been an intense effort to create this project and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

Currently, there are two key branches in the project:

  • The release/0.5.0 branch contains what we'd call the stable code of the first release of TensorRT-LLM (version 0.5.0). It has been QA-ed and carefully tested,
  • The main branch contains what we'd call the dev code. It is more experimental.

We plan to update the main branch regularly with new features, bug fixes and performance optimizations. The stable branch(es) will be updated less frequently. The exact frequencies will depend on your feedback.

Thanks,
Julien (Eng. Lead for TensorRT-LLM)

Tactic running out of memory during Code Llama 34B build

On machines with either 8x A100-80GB or 8x H100, I'm getting many tactic out of memory issues during the build.

The tactic says it requesting 530000 MB while the GPU has 80GB, yet I only observe ~10GB in GPU memory utilization during the build.

Here is my script:

python build.py --model_dir ./Phind-CodeLlama-34B-v2 \
                --dtype bfloat16 \
                --remove_input_padding \
                --use_gpt_attention_plugin bfloat16 \
                --enable_context_fmha \
                --use_gemm_plugin bfloat16 \
                --paged_kv_cache \
                --use_parallel_embedding \
                --use_inflight_batching \
                --max_input_len 14848 \
                --max_output_len 1536 \
		--vocab_size 32000 \
                --rotary_base 1000000 \
                --output_dir ./Phind/Phind-CodeLlama-34B-v2/trt-engines/bf16/8-gpu \
                --world_size 8 \
                --tp_size 8 \
                --parallel_build

The same issue happens for much smaller input and output lens as well, which suggests that isn't the issue.

Phind-CodeLlama-34B is a standard 34B Code Llama that has been fine-tuned but is architecturally identical and is available here: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2.

  1. Are these tactic errors resulting in a less optimized model? The model is still usable but it's slower than I expected.
  2. I also tried running with --builder_opt=5 for max optimizations but that model fails to load into the Triton backend completely

The documentation here could be improved -- I'd love to know what I can do to get the most optimized model possible @byshiue.

How to output intermediate result of model?

I am trying to support my LLM model in TRT-LLM. However, there is a significant difference between TRT-LLM and torch. I want to output the intermediate result of TRT-LLM (e.g. gpt_attention) when running prompts to figure out which layer has the precision error.

Failed to run benchmark llama-7b.

My code.

python benchmark.py \
    -m llama_7b \
    --mode plugin \
    --batch_size "1;10;30;60" \
    --input_output_len "128,512;128,1024;128,2048;128,4096"

The error.

[WARNING] check inlen(128) <= max_inlen(512) and outlen(512) <= max_outlen(200) failed, skipping.
[WARNING] check inlen(128) <= max_inlen(512) and outlen(1024) <= max_outlen(200) failed, skipping.
[WARNING] check inlen(128) <= max_inlen(512) and outlen(2048) <= max_outlen(200) failed, skipping.
[WARNING] check inlen(128) <= max_inlen(512) and outlen(4096) <= max_outlen(200) failed, skipping.

build failure: libnvparsers not found in tensorRT 9.1.0.4

cmake configuration reports:

-- ========================= Importing and creating target nvuffparser ==========================
-- Looking for library nvparsers
-- Library that was found nvparsers_LIB_PATH-NOTFOUND
-- ==========================================================================================

and I can not find libparsers.so in tensorRT package, only libonnxparser exists:

./targets/x86_64-linux-gnu/lib/libnvonnxparser.so.9
./targets/x86_64-linux-gnu/lib/libnvonnxparser.so
./targets/x86_64-linux-gnu/lib/libnvonnxparser.so.9.1.0
./targets/x86_64-linux-gnu/lib/stubs/libnvonnxparser.so
./targets/x86_64-linux-gnu/lib/libnvonnxparser_static.a

Seeking Clarification on TensorRT-LLM Workflow for Extending Model Support

I have some insights regarding the TensorRT-LLM process, and I'm seeking clarification to understand it better, especially as I aim to create support for certain models which the examples that have been provided is not close to my cases. My understanding so far is that we need to create a new model, say from a PyTorch model, due to some unique layers in this repository which will eventually be part of a CUDA graph for generating the TRT engine. The idea is to map the weights from PyTorch to a model compatible with TensorRT-LLM, enabling the creation of the TRT engine, which can then be utilized in run.py. So for any new models we need to create a new model on top of the layers provided in TensorRT-LLM and then weights should be copied into the model. If there are inaccuracies in my understanding, I appreciate any guidance to grasp the underlying processes better.

Any support for RWKV plz?

RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode.

So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).

Project Homepage: https://github.com/BlinkDL/RWKV-LM

Does TensorRT-LLM support such projects?

Baichuan V2 13B,INT4 weight only, 输出可能有问题

Baichuan V2 13B
--max_output_len=1024
a single GPU
INT4 weight-only quantization

如上的一些配置,在执行 run.py的时候,会输出类似的结果,没有 Output
Input: "世界上第一高的山峰是哪座?" Output: "</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s>

其他的一些组合配置,没有发现这个问题

Mistral 7B support

Hi Nvidia, great work with this release. As you may be aware of, Mistral is a new Apache 2.0-licensed model that beats most Llama-2 models. The architecture and names of layers are nearly identical except for the usage of SWA.

Please add support for their models, it would be greatly appreciated as they are MUCH better than Llama models.

Complex beam search

Hello! Thank you for this incredible project. It has been extremely useful.

I've noticed that while the current beam search method is quite effective, there are some variations like "group beam search" and "diverse beam search" that might provide improved results in some scenarios. Would it be possible to consider supporting these methods in future updates?

CMake Error at CMakeLists.txt:288 (file): file STRINGS file "/usr/local/tensorrt/include/NvInferVersion.h" cannot be read.

Hi, there. I encountered an error while running python3 ./scripts/build_wheel.py --clean --trt_root /usr/local/tensorrt. The error message reads "CMake Error at CMakeLists.txt:288 (file): file STRINGS file "/usr/local/tensorrt/include/NvInferVersion.h" cannot be read.".
I have installed TensorRT using pip3. It appears that TensorRT was installed in the directory ".local/lib/python3.9/site-packages", and the file "/include/NvInferVersion.h" is missing. Should I install TensorRT through source code building ? Can you please suggest a solution to fix this error?

tensorrt==9.1.0.post12.dev4
tensorrt-bindings==9.1.0.post12.dev4
tensorrt-libs==9.1.0.post12.dev4

I'm benchmarking llama-7b with 8 batch size in A40,but oom happened,I'm curious why 7B model need to cost too much memory?

python benchmark.py -m llama_7b --batch_size "8" --mode plugin --input_output_len '2048,2048' --csv --max_input_len 2048 --max_output_len 2048

log error:
[TRT-LLM] [E] Exception CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 147.88 MiB is free. Process 1107750 has 44.16 GiB memory in use. Of the allocated memory 12.00 GiB is allocated by PyTorch, and 983.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF caught while allocating memory; skipping (8, 2048, 2048)

A40 is 46GB memory, I think it's enough to support to 8 batchsize for llama-7b.

And can you tell me how to estimate the memory using in defferent batchsize of differnet models, like 7b model, the weight have to cost how much memory, and if input_output_length is 2k/2k, every batchsize should cost how much memory?

Attribute issue while executing build.py file

I installed all the dependencies and tried to run build.py file for llama2, but the following issue

Traceback (most recent call last):
File "/home/shavak/wk-atharv/WebScrappingChatBot/llm/udemy/TensorRT-LLM/examples/llama/build.py", line 25, in
from weight import (get_scaling_factors, load_from_awq_llama, load_from_binary,
File "/home/shavak/wk-atharv/WebScrappingChatBot/llm/udemy/TensorRT-LLM/examples/llama/weight.py", line 25, in
import tensorrt_llm
File "/home/shavak/wk-atharv/WebScrappingChatBot/llm/udemy/TensorRT-LLM/examples/llama/tensorrt_llm/init.py", line 15, in
import tensorrt_llm.functional as functional
File "/home/shavak/wk-atharv/WebScrappingChatBot/llm/udemy/TensorRT-LLM/examples/llama/tensorrt_llm/functional.py", line 25, in
from . import graph_rewriting as gw
File "/home/shavak/wk-atharv/WebScrappingChatBot/llm/udemy/TensorRT-LLM/examples/llama/tensorrt_llm/graph_rewriting.py", line 11, in
from .network import Network
File "/home/shavak/wk-atharv/WebScrappingChatBot/llm/udemy/TensorRT-LLM/examples/llama/tensorrt_llm/network.py", line 26, in
from ._common import set_network
File "/home/shavak/wk-atharv/WebScrappingChatBot/llm/udemy/TensorRT-LLM/examples/llama/tensorrt_llm/_common.py", line 21, in
from ._utils import str_dtype_to_trt
File "/home/shavak/wk-atharv/WebScrappingChatBot/llm/udemy/TensorRT-LLM/examples/llama/tensorrt_llm/_utils.py", line 81, in
int64=trt.int64,
AttributeError: module 'tensorrt' has no attribute 'int64'
'

Can anayone please help

How to benchmark offline throughput?

  1. The scripts in benchmarks are only for latency benchmark. As we can compare it with other LLM inference framworks, is there any demo for offline throughput? I have compiled llm engine with option --use_inflight_batching.
  2. Is build option --use_inflight_batching compatible with --use_gpt_attention_plugin bfloat16? The README said Note that in-flight batching in C++ runtime works only with attention plugin --use_gpt_attention_plugin=float16, paged KV cache --paged_kv_cache and with packed data --remove_input_padding.

Llama 2 with LoRA

is there any example to convert llama2 model fine-tuned with LoRA to TensorRT-LLM ?

Docker image

could anyone tell how much time would it take to release the docker image?

TensorRT 9 not available

Hello,

The release notes said that TensorRT-LLM requires TensorRT 9.1.0.4 and 23.08 containers.
Where can I find TensorRT 9.1.0.4? This version seems not available yet unfortunately

Thanks

wrong output in GPT2 example

image

Command :

root@r3sist-B450-AORUS-ELITE-release:/code/tensorrt_llm/examples/gpt# python3 run.py --max_output_len=8
Input: "Born in north-east France, Soyer trained as a"
Output: "!!!!!!!!"

How do I fix this bug?

segmentation fault : llama 2 with num_beams > 1

run example/llama/run.py with --num_beams 4 will cause an exception:

Running the float16 engine ...
[9087e267f1dd:12236] *** Process received signal ***
[9087e267f1dd:12236] Signal: Segmentation fault (11)
[9087e267f1dd:12236] Signal code: Address not mapped (1)
[9087e267f1dd:12236] Failing at address: (nil)
│[9087e267f1dd:12236] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7f76f0772f10]
[9087e267f1dd:12236] [ 1] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x1b9b8)[0x7f76ec6589b8]
[9087e267f1dd:12236] [ 2] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0xcd)[0x7f76ec65965d]
[9087e267f1dd:12236] [ 3] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x17f0a)[0x7f76ec654f0a]
[9087e267f1dd:12236] [ 4] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x1cd)[0x7f76ec6568cd]
[9087e267f1dd:12236] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x37)[0x7f76ea03a1a7]
[9087e267f1dd:12236] [ 6] /mnt0/TensorRT-9.1.0.4/lib/libnvinfer.so.9(+0xf9fcb5)[0x7f75ac8a8cb5]
[9087e267f1dd:12236] *** End of error message ***
Segmentation fault 

num_beams=1 is ok.

  • model: llama-7b-hf
  • Nvidia driver: 535.104.12
  • cuda: 12.2
  • cudnn: 8.9.4.25
  • tensorrt: 9.1.0.4

run tritonserver failed for chatglm2

For chatglm2 model, I build engine and launch run.py all well。
Then ,I build the docker and launch tritonserver
https://github.com/triton-inference-server/tensorrtllm_backend#launch-triton-server-within-ngc-container

get error:

+----------------+---------+--------------------------------------------------------------------------------------------------------------------+
| Model          | Version | Status                                                                                                             |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1       | READY                                                                                                              |
| preprocessing  | 1       | READY                                                                                                              |
| tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed:  |
|                |         | mpiSize == tp * pp (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:80)                                 |
|                |         | 1       0x7f923a86a645 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x17645) [0x7f923a86a645]  |
|                |         | 2       0x7f923a87748d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x2448d) [0x7f923a87748d]  |
|                |         | 3       0x7f923a8a9722 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56722) [0x7f923a8a9722]  |
|                |         | 4       0x7f923a8a4335 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x51335) [0x7f923a8a4335]  |
|                |         | 5       0x7f923a8a221b /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f21b) [0x7f923a8a221b]  |
|                |         | 6       0x7f923a885ec2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x32ec2) [0x7f923a885ec2]  |
|                |         | 7       0x7f923a885f75 TRITONBACKEND_ModelInstanceInitialize + 101                                                 |
|                |         | 8       0x7f93641a4116 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a0116) [0x7f93641a4116]                 |
|                |         | 9       0x7f93641a5356 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1356) [0x7f93641a5356]                 |
|                |         | 10      0x7f9364189bd5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x185bd5) [0x7f9364189bd5]                 |
|                |         | 11      0x7f936418a216 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x186216) [0x7f936418a216]                 |
|                |         | 12      0x7f936419531d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19131d) [0x7f936419531d]                 |
|                |         | 13      0x7f9363807f68 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99f68) [0x7f9363807f68]                              |
|                |         | 14      0x7f9364181adb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17dadb) [0x7f9364181adb]                 |
|                |         | 15      0x7f936418f865 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b865) [0x7f936418f865]                 |
|                |         | 16      0x7f9364194682 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190682) [0x7f9364194682]                 |
|                |         | 17      0x7f9364277230 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x273230) [0x7f9364277230]                 |
|                |         | 18      0x7f936427a923 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x276923) [0x7f936427a923]                 |
|                |         | 19      0x7f93643c3e52 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3bfe52) [0x7f93643c3e52]                 |
|                |         | 20      0x7f9363a72253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9363a72253]                         |
|                |         | 21      0x7f9363802b43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f9363802b43]                              |
|                |         | 22      0x7f9363893bb4 clone + 68

Handling kv-cache in multi-modal GPT

Hey,

Firstly, a big thank you for the fantastic work! 🎊

I'm currently attempting to extend your GPT2 example to develop a multi-modal GPT. Essentially, I duplicated the tensorrt_llm/models/gpt and introduced a new embedding along with a head to cater to a new modality. I rely on examples/gpt/build.py for building my model and use examples/gpt/run.py for generation.

My main challenge lies in supporting the kv-cache, as there's a shift in model behavior between the initial and subsequent passes:

  1. First pass (context): All the hidden states are passed to the decoder blocks. The flow is as follows:

    gpt_decoder_layer(concat([m1_embedding(m1_inputs), m2_embedding(m2_inputs)], dim=1))
  2. Subsequent passes (generation): Only the hidden state of the second modality gets directed to the decoder blocks. This is because the hidden states from the first modality are already stored in the kv-cache, and we are merely decoding for the second modality. In this case, the length of m2_inputs is 1, given the kv-cache. Here's the flow:

    gpt_decoder_layer(m2_embedding(m2_inputs))

I made an initial attempt with this logic:

if shape(input_ids.data, 1) == 1:
  # If we receive a single token, execute the kv-cache pass (generation)
  hidden_states = gpt_decoder_layer(m2_embedding(m2_inputs))
else:
  # If all tokens are received, execute a full pass (context)
  hidden_states = gpt_decoder_layer(concat([m1_embedding(m1_inputs), m2_embedding(m2_inputs)], dim=1))

However, it didn't quite pan out as I had hoped.

Would you be able to guide me on how to design a model that can support the conditional flow I mentioned above?

Do let me know if you need any further clarification on any aspect. Thanks in advance!

Performance decay when using paged attention

Here is my benchmark result on A30 using llama-7b

image

Seems the performance using paged attention is much worse than normal.
And also the performance is a little lower than FT.

Is that normal?

Failed to run batch inference

I'm running llama-7b on A30 single GPU.
When the batch is 1, it runs well.
I try to run with input ids with shape[2, 32], and input_lengths is tensor([32, 32], but it complains like this:

torch.Size([2, 32]) tensor([32, 32], device='cuda:0', dtype=torch.int32)
decoder.setup 2 32 50 1
[10/20/2023-10:18:29] [TRT] [E] 3: [executionContext.cpp::setInputShape::2257] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2257, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
Traceback (most recent call last):
  File "/data/trt-llm/TensorRT-LLM/examples/llama/run.py", line 299, in <module>
    generate(**vars(args))
  File "/data/trt-llm/TensorRT-LLM/examples/llama/run.py", line 259, in generate
    output_gen_ids = decoder.decode(input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 514, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1887, in decode
    return self.decode_regular(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1659, in decode_regular
    should_stop, next_step_buffer, tasks, context_lengths, host_context_lengths, attention_mask, logits = self.handle_per_step(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1430, in handle_per_step
    self.runtime._set_shape(context, ctx_shape)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 202, in _set_shape
    raise ValueError(
ValueError: Couldn't assign input_ids with shape torch.Size([2, 32]), engine supports [min, opt, max] = [(1, 1), (1, 8), (1, 16384)]

Seems the max_batch_size = 8 in the build phase is not correctly set to the engine.

Here is my convert script:

python build.py --model_dir /data/models/llama-7b-hf/ \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/llama/7B/trt_engines/fp16/1-gpu/

Build Failure: libtensorrt_llm_batch_manager_static.a:1: syntax error

Hi guys, I am trying to install TensorRT-LLM following the instructions Build Step-by-step.

I have successfully built docker image and run container on my machine using:

make -C docker build
make -C docker run
image

After running python3 ./scripts/build_wheel.py --clean --trt_root /usr/local/tensorrt, I have:

[ 98%] Built target kernels_src
[ 98%] Linking CXX static library libtensorrt_llm_static.a
[ 98%] Built target tensorrt_llm_static
[100%] Linking CXX shared library libtensorrt_llm.so
/usr/bin/ld:/code/tensorrt_llm/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a: file format not recognized; treating as linker script
/usr/bin/ld:/code/tensorrt_llm/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a:1: syntax error
collect2: error: ld returned 1 exit status
gmake[3]: *** [tensorrt_llm/CMakeFiles/tensorrt_llm.dir/build.make:714: tensorrt_llm/libtensorrt_llm.so] Error 1
gmake[2]: *** [CMakeFiles/Makefile2:677: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/all] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:684: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/rule] Error 2
gmake: *** [Makefile:179: tensorrt_llm] Error 2
Traceback (most recent call last):
  File "/code/tensorrt_llm/./scripts/build_wheel.py", line 248, in <module>
    main(**vars(args))
  File "/code/tensorrt_llm/./scripts/build_wheel.py", line 152, in main
    build_run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'cmake --build . --config Release --parallel 16 --target tensorrt_llm tensorrt_llm_static nvinfer_plugin_tensorrt_llm th_common ' returned non-zero exit status 2.

Could you take a look at the issue above? Thanks so much.

x86_64-conda_cos6-linux-gnu-cc: command not found

Trying to install TensorRT-LLM on my owner docker, met this problem:

$ python3 scripts/build_wheel.py --trt_root="${TRT_ROOT}" -i -c && cd ..
...
building 'mpi4py.dl' extension
/opt/conda/envs/llama-trt/bin/x86_64-conda-linux-gnu-cc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -Wstrict-prototypes -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/envs/llama-trt/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /opt/conda/envs/llama-trt/include -fPIC -DHAVE_DLFCN_H=1 -DHAVE_DLOPEN=1 -I/opt/conda/envs/llama-trt/include/python3.8 -c src/dynload.c -o build/temp.linux-x86_64-cpython-38/src/dynload.o
/opt/conda/envs/llama-trt/bin/x86_64-conda-linux-gnu-cc -shared -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,-rpath,/opt/conda/envs/llama-trt/lib -L/opt/conda/envs/llama-trt/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,-rpath,/opt/conda/envs/llama-trt/lib -L/opt/conda/envs/llama-trt/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/opt/conda/envs/llama-trt/lib -Wl,-rpath-link,/opt/conda/envs/llama-trt/lib -L/opt/conda/envs/llama-trt/lib -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/envs/llama-trt/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /opt/conda/envs/llama-trt/include build/temp.linux-x86_64-cpython-38/src/dynload.o -Lbuild/temp.linux-x86_64-cpython-38 -ldl -o build/lib.linux-x86_64-cpython-38/mpi4py/dl.cpython-38-x86_64-linux-gnu.so
checking for MPI compile and link ...
/opt/conda/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -Wstrict-prototypes -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/envs/llama-trt/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /opt/conda/envs/llama-trt/include -fPIC -I/opt/conda/envs/llama-trt/include/python3.8 -c _configtest.c -o _configtest.o
/opt/conda/bin/mpicc: line 301: x86_64-conda_cos6-linux-gnu-cc: command not found
failure.
removing: _configtest.c _configtest.o
error: Cannot compile MPI programs. Check your configuration!!!
[end of output]

which x86_64-conda_cos6-linux-gnu-cc output nothting
which x86_64-conda_cos7-linux-gnu-cc
/opt/conda/envs/llama-trt/bin/x86_64-conda_cos7-linux-gnu-cc

a error when build tensorrt_llm

[ 98%] Built target layers_src
[ 98%] Built target common_src
[ 98%] Built target runtime_src
[ 98%] Built target kernels_src
[ 98%] Linking CXX static library libtensorrt_llm_static.a
[ 98%] Built target tensorrt_llm_static
[100%] Linking CXX shared library libtensorrt_llm.so
/usr/bin/ld:/workspace/base/TensorRT-LLM-release-0.5.0/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a: file format not recognized; treating as linker script
/usr/bin/ld:/workspace/base/TensorRT-LLM-release-0.5.0/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a:1: syntax error
collect2: error: ld returned 1 exit status
gmake[3]: *** [tensorrt_llm/CMakeFiles/tensorrt_llm.dir/build.make:714: tensorrt_llm/libtensorrt_llm.so] Error 1
gmake[2]: *** [CMakeFiles/Makefile2:677: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/all] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:684: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/rule] Error 2
gmake: *** [Makefile:179: tensorrt_llm] Error 2

with build command

./scripts/build_wheel.py --clean --cuda_architectures "80-real" --trt_root /usr/local/tensorrt
  • libtensorrt_llm_batch_manager_static.a:1: syntax error

Error on docker file build

Got the following error when building the docker file, I followed the instructions written in here:

7145.6 [ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int4b.cu.o
7159.0 [ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int8b.cu.o
7175.4 [ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int4b.cu.o
7175.7 [ 98%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int8b.cu.o
7507.6 [ 98%] Built target kernels_src
7507.6 [ 98%] Linking CXX static library libtensorrt_llm_static.a
7523.5 [ 98%] Built target tensorrt_llm_static
7523.5 [100%] Linking CXX shared library libtensorrt_llm.so
7523.6 /usr/bin/ld:/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a: file format not recognized; treating as linker script
7523.6 /usr/bin/ld:/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a:1: syntax error
7523.7 collect2: error: ld returned 1 exit status
7523.7 gmake[3]: *** [tensorrt_llm/CMakeFiles/tensorrt_llm.dir/build.make:714: tensorrt_llm/libtensorrt_llm.so] Error 1
7523.7 gmake[2]: *** [CMakeFiles/Makefile2:677: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/all] Error 2
7523.7 gmake[1]: *** [CMakeFiles/Makefile2:684: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/rule] Error 2
7523.7 gmake: *** [Makefile:179: tensorrt_llm] Error 2
7523.7 Traceback (most recent call last):
7523.7   File "/src/tensorrt_llm/scripts/build_wheel.py", line 248, in <module>
7523.7     main(**vars(args))
7523.7   File "/src/tensorrt_llm/scripts/build_wheel.py", line 152, in main
7523.7     build_run(
7523.7   File "/usr/lib/python3.10/subprocess.py", line 526, in run
7523.7     raise CalledProcessError(retcode, process.args,
7523.7 subprocess.CalledProcessError: Command 'cmake --build . --config Release --parallel 8 --target tensorrt_llm tensorrt_llm_static nvinfer_plugin_tensorrt_llm th_common ' returned non-zero exit status 2.
------
Dockerfile.multi:48
--------------------
  46 |     
  47 |     ARG BUILD_WHEEL_ARGS="--clean --trt_root /usr/local/tensorrt"
  48 | >>> RUN python3 scripts/build_wheel.py ${BUILD_WHEEL_ARGS}
  49 |     
  50 |     FROM devel as release
--------------------
ERROR: failed to solve: process "/bin/bash -c python3 scripts/build_wheel.py ${BUILD_WHEEL_ARGS}" did not complete successfully: exit code: 1
make: *** [Makefile:47: release_build] Error 1
make: Leaving directory '/saeed/openai-whisper/TensorRT-LLM/docker'

Building in TensorRT docker container

Hello, I am using TensorRT docker container 23.09 and then building TensorRT-LLM inside the container,
I follow these steps:

apt-get update && apt-get -y install git git-lfs

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs install
git lfs pull

then I change the tensorrt lib include path in scripts/build.py since the tensorrt header files in docker container are stored in /usr/include/x86_64-linux-gnu, so i had to change:

if trt_root is not None:
        trt_root = trt_root.replace("\\", "/")
        trt_lib_dir_candidates = (
            f"/usr/lib/x86_64-linux-gnu/",
            f"{trt_root}/lib")
        try:
            trt_lib_dir = next(
                filter(lambda x: Path(x).exists(), trt_lib_dir_candidates))
        except StopIteration:
            trt_lib_dir = trt_lib_dir_candidates[0]
        cmake_def_args.append(f"-DTRT_LIB_DIR={trt_lib_dir}")
        cmake_def_args.append(f"-DTRT_INCLUDE_DIR={trt_root}")

ANd then i run to build wheel:

python3 ./scripts/build_wheel.py --clean  --trt_root /usr/include/x86_64-linux-gnu/

But I face this error:

[ 91%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int8b.cu.o
[ 91%] Built target kernels_src
gmake[1]: *** [CMakeFiles/Makefile2:684: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/rule] Error 2
gmake: *** [Makefile:179: tensorrt_llm] Error 2
Traceback (most recent call last):
  File "/workspace/TensorRT-LLM/./scripts/build_wheel.py", line 248, in <module>
    main(**vars(args))
  File "/workspace/TensorRT-LLM/./scripts/build_wheel.py", line 152, in main
    build_run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'cmake --build . --config Release --parallel 6 --target tensorrt_llm tensorrt_llm_static nvinfer_plugin_tensorrt_llm th_common ' returned non-zero exit status 2.

starcoder engine building failed

The error occurs when I convert from ft to engine.

Error info:

[10/21/2023-07:49:30] [TRT] [E] 9: Skipping tactic0x0000000000000000 due to exception PLUGIN_V2 operation not supported within this graph.
[10/21/2023-07:49:30] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[GPTLMHeadModel/layers/0/attention/qkv/CONSTANT_1 + GPTLMHeadModel/layers/0/attention/qkv/SHUFFLE_0...GPTLMHeadModel/layers/0/attention/PLUGIN_V2_GPTAttention_0]}.
[10/21/2023-07:49:30] [TRT] [E] 10: [optimizer.cpp::computeCosts::4040] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[GPTLMHeadModel/layers/0/attention/qkv/CONSTANT_1 + GPTLMHeadModel/layers/0/attention/qkv/SHUFFLE_0...GPTLMHeadModel/layers/0/attention/PLUGIN_V2_GPTAttention_0]}.)
[10/21/2023-07:49:30] [TRT-LLM] [E] Engine building failed, please check the error log.

ERROR: This container was built for NVIDIA Driver Release 535.86 or later, compatibility mode is UNAVAILABLE.

ERROR: This container was built for NVIDIA Driver Release 535.86 or later, but
       version 510.85.02 was detected and compatibility mode is UNAVAILABLE.

I build docker image on this machine:

 NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6

When I run container it complains this, what does this mean? Can I still run TRT-LLM on this container?

Seems FT can still run ok in this container.

Methods to evaluate Throughput (tokens/s)

Hello:
I am wondering how do you evaluate throughput (tokens/s). Like as for Throughput you calculated, did you only count token Throughput (or including the context process)?
Thank you!

Nvidia Jetson device Support

Dear Nvidia Team,

I would like to request support for running TensorRT-LLM on the Nvidia AGX Orin development kit.

Thank you!

Best regards,
Shakhizat

Failed to build with CUDA 11.8 due to change in cudaGraphExecUpdate parameter

In CUDA 11.8, cudaGraphExecUpdate signature is:

__host__​cudaError_t cudaGraphExecUpdate ( cudaGraphExec_t hGraphExec, cudaGraph_t hGraph, cudaGraphNode_t* hErrorNode_out, cudaGraphExecUpdateResult ** updateResult_out )

And in CUDA 12
cudaGraphExecUpdate signature is:

__host__​cudaError_t cudaGraphExecUpdate ( cudaGraphExec_t hGraphExec, cudaGraph_t hGraph, cudaGraphExecUpdateResultInfo* resultInfo )

This change cause compiling error in tensorrt_llm/runtime.

gptManagerBenchmark std::bad_alloc error

machine: nv 4090 24GB
model: llama13B-gptq (the GPU memory should be enough)
Problem: std::bad_alloc error when starting GptManager.
Expected: runs successfully.

root@ubuntu-devel:/code/tensorrt_llm/cpp/build/benchmarks# CUDA_VISIBLE_DEVICES=0  ./gptManagerBenchmark     --model llama13b_gptq_compiled     --engine_dir /code/tensorrt_llm/models/llama13b_gptq_compiled     --type IFB     --dataset /code/tensorrt_llm/models/llama13b_gptq/preprocessed_dataset.json  --log_level verbose --kv_cache_free_gpu_mem_fraction 0.2
[TensorRT-LLM][INFO] Set logger level by TRACE
[TensorRT-LLM][DEBUG] Registered plugin creator Identity version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator BertAttention version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator GPTAttention version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Gemm version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Send version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Recv version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator AllReduce version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator AllGather version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Layernorm version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Rmsnorm version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator SmoothQuantGemm version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator LayernormQuantization version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator QuantizePerToken version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator QuantizeTensor version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator RmsnormQuantization version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator WeightOnlyGroupwiseQuantMatmul version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator WeightOnlyQuantMatmul version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Lookup version 1 in namespace tensorrt_llm
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][ERROR] std::bad_alloc

Build failures

Trying to build from source following instructions in
https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/docs/source/installation.md
via docker. Getting this https://gist.github.com/divchenko/4bcd575954f3a6b5e3350fd2d2762002

Switched to C++ only build:
python3 ./scripts/build_wheel.py --cuda_architectures "80-real;90-real" --cpp_only --clean
but then building benchmarks fails w/ C++ 11 ABI linking issues... Looking at symbols - looks like only pre-c++11 are built.
Something's serious wrong w/ the setup here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.