Comments (4)
this seems to be the download error, can you check if you have installed the git, and git-lfs properly in your env
from mlc-llm.
@tqchen Thanks for your help, installing git and git-lfs solved the problem! Now I have another one.
(llmENV) C:\Users\mypc>mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-05-16 10:27:35] INFO auto_device.py:88: ←[91mNot found←[0m device: cuda:0
[2024-05-16 10:27:37] INFO auto_device.py:88: ←[91mNot found←[0m device: rocm:0
[2024-05-16 10:27:39] INFO auto_device.py:88: ←[91mNot found←[0m device: metal:0
[2024-05-16 10:27:44] INFO auto_device.py:79: ←[92mFound←[0m device: vulkan:0
[2024-05-16 10:27:47] INFO auto_device.py:88: ←[91mNot found←[0m device: opencl:0
[2024-05-16 10:27:47] INFO auto_device.py:35: Using device: ←[1mvulkan:0←[0m
[2024-05-16 10:27:47] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-05-16 10:27:47] INFO download.py:133: Weights already downloaded: C:\Users\mypc\AppData\Local\mlc_llm\model_weights\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC
[2024-05-16 10:27:47] INFO chat_module.py:781: Now compiling model lib on device...
[2024-05-16 10:27:47] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-05-16 10:27:47] INFO jit.py:120: Compiling using commands below:
[2024-05-16 10:27:47] INFO jit.py:121: 'C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\python.exe' -m mlc_llm compile 'C:\Users\mypc\AppData\Local\mlc_llm\model_weights\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC' --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=1;cudagraph=0;cutlass=1;ipc_allreduce_strategy=NONE' --overrides 'context_window_size=8192;prefill_chunk_size=1024;tensor_parallel_shards=1' --device vulkan:0 --output 'C:\Users\mypc\AppData\Local\Temp\tmpin09k7zj\lib.dll'
[2024-05-16 10:27:50] INFO auto_config.py:69: Found model configuration: C:\Users\mypc\AppData\Local\mlc_llm\model_weights\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC\mlc-chat-config.json
[2024-05-16 10:27:50] INFO auto_target.py:84: Detecting target device: vulkan:0
[2024-05-16 10:27:50] INFO auto_target.py:86: Found target: {"thread_warp_size": 1, "supports_float32": T.bool(True), "supports_int16": 1, "supports_int32": T.bool(True), "max_threads_per_block": 1536, "supports_int8": 1, "supports_int64": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 49152, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 0}
[2024-05-16 10:27:50] INFO auto_target.py:103: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-05-16 10:27:50] INFO auto_target.py:104: Found host LLVM CPU: skylake-avx512
[2024-05-16 10:27:50] INFO auto_config.py:153: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
--config LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=1024, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
--quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
--model-type llama
--target {"thread_warp_size": 1, "host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "skylake-avx512", "keys": ["cpu"]}, "supports_int16": 1, "supports_float32": T.bool(True), "supports_int32": T.bool(True), "max_threads_per_block": 1536, "supports_int8": 1, "supports_int64": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 49152, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 0}
--opt flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
--system-lib-prefix ""
--output C:\Users\mypc\AppData\Local\Temp\tmpin09k7zj\lib.dll
--overrides context_window_size=8192;sliding_window_size=None;prefill_chunk_size=1024;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-05-16 10:27:50] INFO config.py:106: Overriding context_window_size from 8192 to 8192
[2024-05-16 10:27:50] INFO config.py:106: Overriding prefill_chunk_size from 1024 to 1024
[2024-05-16 10:27:50] INFO config.py:106: Overriding tensor_parallel_shards from 1 to 1
[2024-05-16 10:27:50] INFO compile.py:138: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=1024, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-05-16 10:27:50] INFO compile.py:157: Exporting the model to TVM Unity compiler
[2024-05-16 10:27:57] INFO compile.py:163: Running optimizations using TVM Unity
[2024-05-16 10:27:57] INFO compile.py:182: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 1024, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0}
[2024-05-16 10:27:59] INFO pipeline.py:52: Running TVM Relax graph-level optimizations
[2024-05-16 10:29:49] INFO pipeline.py:52: Lowering to TVM TIR kernels
[2024-05-16 10:29:59] INFO pipeline.py:52: Running TVM TIR-level optimizations
[2024-05-16 10:30:25] INFO pipeline.py:52: Running TVM Dlight low-level optimizations
[2024-05-16 10:30:28] INFO pipeline.py:52: Lowering to VM bytecode
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `alloc_embedding_tensor`: 8.00 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode`: 11.56 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode_to_last_hidden_states`: 12.19 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill`: 148.00 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 156.00 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_select_last_hidden_states`: 0.62 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify`: 148.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify_to_last_hidden_states`: 156.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode`: 0.14 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode_to_last_hidden_states`: 0.15 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `embed`: 8.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `gather_hidden_states`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `get_logits`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill`: 148.01 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill_to_last_hidden_states`: 156.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `scatter_hidden_states`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-05-16 10:30:36] INFO pipeline.py:52: Compiling external modules
[2024-05-16 10:30:36] INFO pipeline.py:52: Compilation complete! Exporting to disk
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\__main__.py", line 56, in <module> main()
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\__main__.py", line 25, in main
cli.main(sys.argv[2:])
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\cli\compile.py", line 128, in main compile(
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\compile.py", line 240, in compile
_compile(args, model_config)
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\compile.py", line 185, in _compile
args.build_func(
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\support\auto_target.py", line 284, in build
relax.build(
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\relax\vm_build.py", line 341, in build return _vmlink(
^^^^^^^^
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\relax\vm_build.py", line 247, in _vmlink
lib = tvm.build(
^^^^^^^^^^
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\driver\build_module.py", line 297, in build
rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 239, in __call__
raise_last_ffi_error()
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
File "D:\a\package\package\tvm\src\target\spirv\ir_builder.cc", line 566
InternalError: Check failed: (spirv_support_.supports_float16) is false: Vulkan target does not support Float16 capability. If your device supports 16-bit float operations, please either add -supports_float16=1 to the target, or query all device parameters by adding -from_device=0.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Scripts\mlc_llm.exe\__main__.py", line 7, in <module>
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\__main__.py", line 37, in main
cli.main(sys.argv[2:])
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\cli\chat.py", line 42, in main
chat(
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\chat.py", line 134, in chat
cm = ChatModule(model, device, chat_config=config, model_lib=model_lib)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\chat_module.py", line 784, in __init__
self.model_lib = jit.jit(
^^^^^^^^
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\jit.py", line 166, in jit
_run_jit(
File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\jit.py", line 126, in _run_jit
raise RuntimeError("Cannot find compilation output, compilation failed")
RuntimeError: Cannot find compilation output, compilation failed
(llmENV) C:\Users\mypc>
I suspect it has to do with the capabilities of the videocard. Is there any way to bypass or manually activate this float16 feature? Or is it related to something else?
from mlc-llm.
error message said you do not support f16, so please try to use a q4f32 variant of the model
from mlc-llm.
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f32_1-MLC
from mlc-llm.
Related Issues (20)
- [Bug] chatglm4 mlc_llm shows error "TVMError: Check failed: append_length > 0 (0 vs. 0) : Append with length 0 is not allowed." during mlc_llm chat CLI HOT 9
- [Doc] A typo in TVM installation docs HOT 1
- QWen1.8b acuracy in noquantize HOT 13
- [Question] Quantization Problems HOT 4
- [Question] Unable to download and compile custom model from Hugging Face using `mlc_llm package` command HOT 5
- [Bug] TVM_SOURCE_DIR In Windows 10. In the identification path is error HOT 2
- [Bug] TVM ERROR when convert_weight HOT 2
- [Bug] Apple Metal/MPS -- TVM/MLC-LLM won't compile from source HOT 7
- [Bug] Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) HOT 4
- [Question] Android demo app change model HOT 3
- Qwen2-72B-Instruct MultiGPU 8xP100 HOT 7
- [Question] Is there a way to compute ppl of models in MLC-LLM? HOT 2
- [Bug] in android folder there is no library folder witch contains prepare_lib.sh ? how build android tvm.so ? HOT 2
- [Bug] in android folder there is no library folder witch contains prepare_lib.sh ? how build android tvm.so ? HOT 1
- Exiting all the time. Android, Redmi Note 13 pro plus [Bug] HOT 15
- [Question] Proper way to use multiple GPUs HOT 10
- 执行mlc_chat指令时总是报错 [Bug] HOT 2
- [Question] Is mlc chat deprecated? HOT 3
- [Bug] The performance accuracy of large models is severely lost after quantization on Qwen2-1.5B-Instruct ,please fix it HOT 3
- [Question] where to find the JAR file or dependency for TVM HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlc-llm.