System Info GPU: NVIDIA A100 TensorRT-LLM version 0.9.0.dev202

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

"High GPU usage leads to NaN values in the encoder output of the T5 model (float16). about tensorrt-llm HOT 9 CLOSED

0xd8b commented on August 17, 2024

"High GPU usage leads to NaN values in the encoder output of the T5 model (float16).

from tensorrt-llm.

Comments (9)

byshiue commented on August 17, 2024

Could you share the end to end reproduced steps?

Also, how to make GPU initial usage is 100%? Running another program at the same time?

from tensorrt-llm.

pommedeterresautee commented on August 17, 2024

Hi,

We had kind of similar issue, for looping and sending many times the same input data to stress test the GPU.

It got us some NaN after hundred of thousands of inference (again same data all the time, and synchronizing everywhere it made sense)! Model follows Roberta architecture.

We noticed that another Roberta based architecture, compiled the same way had not such issue, so we thought it s weight related. We switched to BF16, keeping the fp16 optimized kernel for flash attention, during build trt complains a lot that some nodes gets bf16 and fp16 inputs (I guess to let us know it adds some casting to many places), still e2e perf are almost the same, output precision is slightly lower (compared to fp32 AMP reference model), but no more issue, and we stress tested it quite a lot since then.

May be something to test @0xd8b ?

from tensorrt-llm.

0xd8b commented on August 17, 2024

We converted the T5 model using the files in example/enc_dec/. The data type used for conversion is float16 (batch_size=1, strongly_typed=True, use_bert_plugin=True). Additionally, we truncated the output of hidden_states. However, during high GPU utilization, the encoder output becomes NaN.

Yes, we concurrently ran another model inference program to achieve 100% GPU utilization, yet only 1/10 of the total memory was utilized.

An intriguing observation is that NaN occurrences do not happen during TRT conversion with float32 data types. Additionally, in float16 types, if the GPU's initial utilization is 0%, inference proceeds normally. From this observation, it seems that the issue is not solely related to data overflow.

from tensorrt-llm.

0xd8b commented on August 17, 2024

@pommedeterresautee thans for your reply! Are you referring to the conversion of the model using the bfloat16 data type

from tensorrt-llm.

pommedeterresautee commented on August 17, 2024

Yes, The conversion to bf16 is done during conversion step. We had to modify the part where it binds weights to trt engine. Have a check to _utils of trt lib, there are some useful stuff for bf16.

from tensorrt-llm.

pommedeterresautee commented on August 17, 2024

Fwiw back in the time I wrote a bunch of custom triton (language not server) kernels and T5 weights were super hard to manage in fp16, largest flavors NaN from time to time depending of the input. It's a long time i didn't touch it, but from what I barely remember Google trained it in bf16. I know at first view it's not related to gpu occupation but may be something to keep in mind.

from tensorrt-llm.

0xd8b commented on August 17, 2024

@pommedeterresautee Ok, thanks for the suggestion, I will give it a try. However, I'm still curious as to why the model (float16 type) works fine at low gpu usage.

from tensorrt-llm.

0xd8b commented on August 17, 2024

We attempted to convert the model to bfloat16 and conduct inference, yet the issue persists even under high GPU utilization. It seems that there's a problem occurring in the computation of the RMSnorm layer."

from tensorrt-llm.

0xd8b commented on August 17, 2024

The issue is also caused by the encoder_input_length problem described in #1847. This issue can be closed.

from tensorrt-llm.

"High GPU usage leads to NaN values in the encoder output of the T5 model (float16). about tensorrt-llm HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs