GithubHelp home page GithubHelp logo

Comments (9)

byshiue avatar byshiue commented on August 17, 2024

Could you share the end to end reproduced steps?

Also, how to make GPU initial usage is 100%? Running another program at the same time?

from tensorrt-llm.

pommedeterresautee avatar pommedeterresautee commented on August 17, 2024

Hi,

We had kind of similar issue, for looping and sending many times the same input data to stress test the GPU.

It got us some NaN after hundred of thousands of inference (again same data all the time, and synchronizing everywhere it made sense)! Model follows Roberta architecture.

We noticed that another Roberta based architecture, compiled the same way had not such issue, so we thought it s weight related. We switched to BF16, keeping the fp16 optimized kernel for flash attention, during build trt complains a lot that some nodes gets bf16 and fp16 inputs (I guess to let us know it adds some casting to many places), still e2e perf are almost the same, output precision is slightly lower (compared to fp32 AMP reference model), but no more issue, and we stress tested it quite a lot since then.

May be something to test @0xd8b ?

from tensorrt-llm.

0xd8b avatar 0xd8b commented on August 17, 2024

We converted the T5 model using the files in example/enc_dec/. The data type used for conversion is float16 (batch_size=1, strongly_typed=True, use_bert_plugin=True). Additionally, we truncated the output of hidden_states. However, during high GPU utilization, the encoder output becomes NaN.

Yes, we concurrently ran another model inference program to achieve 100% GPU utilization, yet only 1/10 of the total memory was utilized.

An intriguing observation is that NaN occurrences do not happen during TRT conversion with float32 data types. Additionally, in float16 types, if the GPU's initial utilization is 0%, inference proceeds normally. From this observation, it seems that the issue is not solely related to data overflow.

from tensorrt-llm.

0xd8b avatar 0xd8b commented on August 17, 2024

@pommedeterresautee thans for your reply! Are you referring to the conversion of the model using the bfloat16 data type

from tensorrt-llm.

pommedeterresautee avatar pommedeterresautee commented on August 17, 2024

Yes, The conversion to bf16 is done during conversion step. We had to modify the part where it binds weights to trt engine. Have a check to _utils of trt lib, there are some useful stuff for bf16.

from tensorrt-llm.

pommedeterresautee avatar pommedeterresautee commented on August 17, 2024

Fwiw back in the time I wrote a bunch of custom triton (language not server) kernels and T5 weights were super hard to manage in fp16, largest flavors NaN from time to time depending of the input. It's a long time i didn't touch it, but from what I barely remember Google trained it in bf16. I know at first view it's not related to gpu occupation but may be something to keep in mind.

from tensorrt-llm.

0xd8b avatar 0xd8b commented on August 17, 2024

@pommedeterresautee Ok, thanks for the suggestion, I will give it a try. However, I'm still curious as to why the model (float16 type) works fine at low gpu usage.

from tensorrt-llm.

0xd8b avatar 0xd8b commented on August 17, 2024

We attempted to convert the model to bfloat16 and conduct inference, yet the issue persists even under high GPU utilization. It seems that there's a problem occurring in the computation of the RMSnorm layer."

from tensorrt-llm.

0xd8b avatar 0xd8b commented on August 17, 2024

The issue is also caused by the encoder_input_length problem described in #1847. This issue can be closed.

from tensorrt-llm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.