Comments (6)
@LIUKAI0815 Thanks for the feedback. Could you kindly tell me which model are you using? This requires using the official GPTQ quantized checkpoints from HF.
from tensorrt-llm.
I have the same issue using a quantized Mistral model : TheBloke/Mistral-7B-v0.1-AWQ
from tensorrt-llm.
@jershi425 I'm using the Qwen1.5-14B-Chat
from tensorrt-llm.
Has this problem been solved? I have the same error when using a quantized mixtral model
from tensorrt-llm.
Has this problem been solved? I have the same error when using a quantized mixtral model
Hi @Mary-Sam could u please list more details/log on your issue? So we can look into it.
from tensorrt-llm.
Hi @nv-guomingz
I run the following command for the quantized model
python3 /tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir /model --output_dir /engine --load_model_on_cpu
I am using the latest version of tensorrt_llm==0.9.0
My model has the following quantization configuration
{
"bits": 4,
"group_size": 128,
"modules_to_not_convert": [
"gate"
],
"quant_method": "awq",
"version": "gemm",
"zero_point": true
}
And I am getting the following error:
2024-06-03 12:56:17,367 utils.common INFO:[TensorRT-LLM] TensorRT-LLM version: 0.9.0
2024-06-03 12:56:17,367 utils.common INFO:We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
2024-06-03 12:56:17,367 utils.common INFO:0.9.0
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.42it/s]
2024-06-03 12:56:17,367 utils.common INFO:Traceback (most recent call last):
2024-06-03 12:56:17,367 utils.common INFO: File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 446, in <module>
2024-06-03 12:56:17,367 utils.common INFO: main()
2024-06-03 12:56:17,367 utils.common INFO: File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 438, in main
2024-06-03 12:56:17,367 utils.common INFO: convert_and_save_hf(args)
2024-06-03 12:56:17,367 utils.common INFO: File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 375, in convert_and_save_hf
2024-06-03 12:56:17,367 utils.common INFO: execute(args.workers, [convert_and_save_rank] * world_size, args)
2024-06-03 12:56:17,367 utils.common INFO: File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 397, in execute
2024-06-03 12:56:17,367 utils.common INFO: f(args, rank)
2024-06-03 12:56:17,367 utils.common INFO: File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 362, in convert_and_save_rank
2024-06-03 12:56:17,367 utils.common INFO: llama = LLaMAForCausalLM.from_hugging_face(
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 244, in from_hugging_face
2024-06-03 12:56:17,367 utils.common INFO: llama = convert.from_hugging_face(
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1192, in from_hugging_face
2024-06-03 12:56:17,367 utils.common INFO: weights = load_weights_from_hf(config=config,
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1296, in load_weights_from_hf
2024-06-03 12:56:17,367 utils.common INFO: weights = convert_hf_llama(
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 964, in convert_hf_llama
2024-06-03 12:56:17,367 utils.common INFO: convert_layer(l)
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 646, in convert_layer
2024-06-03 12:56:17,367 utils.common INFO: q_weight = get_weight(model_params, prefix + 'self_attn.q_proj', dtype)
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 399, in get_weight
2024-06-03 12:56:17,367 utils.common INFO: if config[prefix + '.weight'].dtype != dtype:
2024-06-03 12:56:17,367 utils.common INFO:KeyError: 'model.layers.0.self_attn.q_proj.weight'
from tensorrt-llm.
Related Issues (20)
- [Feature Request]: support for vAttention style paging for attention
- How to quantize customed models, such as LVM? HOT 3
- How to use Medusa to support non llama models? HOT 3
- [Feature]: FlashAttention 3 support HOT 1
- H20 runs python benchmark with enable_cuda_graph encounters cudaDeviceSynchronize runtime error
- Does tensorrt-llm support blip2 with fp8 quantization??
- LLAMA checkpoint ImportError: undefined symbol HOT 5
- [Question] accelerating groupwise weightOnlyGemm on v100 using double buffering? HOT 2
- [Feature] quantize_by_modelopt.py get_tokenizer is not suitable for CodeQwen1.5 7B Chat HOT 2
- Question on how to perform cross-attention with FMHA kernel
- Model Performance Degraded when using BFLOAT16 LoRa Adapters
- Is MPI required even multi device is disabled? HOT 1
- Whisper example crashes with English-only models HOT 2
- better replace submodules update link to ssh
- [model support] please add support minicpm
- 【Doc】Chunk Prefill Doc doesn't exist HOT 1
- How to control out of memory error with PYTORCH_CUDA_ALLOC_CONF? HOT 3
- fused_multihead_attention_v2 CUDA Error: CUDA_ERROR_INVALID_VALUE
- trtllm-build qwen2 0.5B failed HOT 1
- [model support] please support mamba-codestral-7B-v0.1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorrt-llm.