GithubHelp home page GithubHelp logo

Comments (4)

tqchen avatar tqchen commented on June 14, 2024

this seems to be the download error, can you check if you have installed the git, and git-lfs properly in your env

from mlc-llm.

BeytoA avatar BeytoA commented on June 14, 2024

@tqchen Thanks for your help, installing git and git-lfs solved the problem! Now I have another one.

(llmENV) C:\Users\mypc>mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

[2024-05-16 10:27:35] INFO auto_device.py:88: ←[91mNot found←[0m device: cuda:0
[2024-05-16 10:27:37] INFO auto_device.py:88: ←[91mNot found←[0m device: rocm:0
[2024-05-16 10:27:39] INFO auto_device.py:88: ←[91mNot found←[0m device: metal:0
[2024-05-16 10:27:44] INFO auto_device.py:79: ←[92mFound←[0m device: vulkan:0
[2024-05-16 10:27:47] INFO auto_device.py:88: ←[91mNot found←[0m device: opencl:0
[2024-05-16 10:27:47] INFO auto_device.py:35: Using device: ←[1mvulkan:0←[0m
[2024-05-16 10:27:47] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-05-16 10:27:47] INFO download.py:133: Weights already downloaded: C:\Users\mypc\AppData\Local\mlc_llm\model_weights\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC
[2024-05-16 10:27:47] INFO chat_module.py:781: Now compiling model lib on device...
[2024-05-16 10:27:47] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-05-16 10:27:47] INFO jit.py:120: Compiling using commands below:
[2024-05-16 10:27:47] INFO jit.py:121: 'C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\python.exe' -m mlc_llm compile 'C:\Users\mypc\AppData\Local\mlc_llm\model_weights\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC' --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=1;cudagraph=0;cutlass=1;ipc_allreduce_strategy=NONE' --overrides 'context_window_size=8192;prefill_chunk_size=1024;tensor_parallel_shards=1' --device vulkan:0 --output 'C:\Users\mypc\AppData\Local\Temp\tmpin09k7zj\lib.dll'
[2024-05-16 10:27:50] INFO auto_config.py:69: Found model configuration: C:\Users\mypc\AppData\Local\mlc_llm\model_weights\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC\mlc-chat-config.json
[2024-05-16 10:27:50] INFO auto_target.py:84: Detecting target device: vulkan:0
[2024-05-16 10:27:50] INFO auto_target.py:86: Found target: {"thread_warp_size": 1, "supports_float32": T.bool(True), "supports_int16": 1, "supports_int32": T.bool(True), "max_threads_per_block": 1536, "supports_int8": 1, "supports_int64": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 49152, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 0}
[2024-05-16 10:27:50] INFO auto_target.py:103: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-05-16 10:27:50] INFO auto_target.py:104: Found host LLVM CPU: skylake-avx512
[2024-05-16 10:27:50] INFO auto_config.py:153: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=1024, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
  --model-type      llama
  --target          {"thread_warp_size": 1, "host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "skylake-avx512", "keys": ["cpu"]}, "supports_int16": 1, "supports_float32": T.bool(True), "supports_int32": T.bool(True), "max_threads_per_block": 1536, "supports_int8": 1, "supports_int64": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 49152, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 0}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          C:\Users\mypc\AppData\Local\Temp\tmpin09k7zj\lib.dll
  --overrides       context_window_size=8192;sliding_window_size=None;prefill_chunk_size=1024;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-05-16 10:27:50] INFO config.py:106: Overriding context_window_size from 8192 to 8192
[2024-05-16 10:27:50] INFO config.py:106: Overriding prefill_chunk_size from 1024 to 1024
[2024-05-16 10:27:50] INFO config.py:106: Overriding tensor_parallel_shards from 1 to 1
[2024-05-16 10:27:50] INFO compile.py:138: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=1024, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-05-16 10:27:50] INFO compile.py:157: Exporting the model to TVM Unity compiler
[2024-05-16 10:27:57] INFO compile.py:163: Running optimizations using TVM Unity
[2024-05-16 10:27:57] INFO compile.py:182: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 1024, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0}
[2024-05-16 10:27:59] INFO pipeline.py:52: Running TVM Relax graph-level optimizations
[2024-05-16 10:29:49] INFO pipeline.py:52: Lowering to TVM TIR kernels
[2024-05-16 10:29:59] INFO pipeline.py:52: Running TVM TIR-level optimizations
[2024-05-16 10:30:25] INFO pipeline.py:52: Running TVM Dlight low-level optimizations
[2024-05-16 10:30:28] INFO pipeline.py:52: Lowering to VM bytecode
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `alloc_embedding_tensor`: 8.00 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode`: 11.56 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode_to_last_hidden_states`: 12.19 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill`: 148.00 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 156.00 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_select_last_hidden_states`: 0.62 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify`: 148.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify_to_last_hidden_states`: 156.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode`: 0.14 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode_to_last_hidden_states`: 0.15 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `embed`: 8.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `gather_hidden_states`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `get_logits`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill`: 148.01 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill_to_last_hidden_states`: 156.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `scatter_hidden_states`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-05-16 10:30:36] INFO pipeline.py:52: Compiling external modules
[2024-05-16 10:30:36] INFO pipeline.py:52: Compilation complete! Exporting to disk
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\__main__.py", line 56, in <module>    main()
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\__main__.py", line 25, in main
    cli.main(sys.argv[2:])
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\cli\compile.py", line 128, in main    compile(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\compile.py", line 240, in compile
    _compile(args, model_config)
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\compile.py", line 185, in _compile
    args.build_func(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\support\auto_target.py", line 284, in build
    relax.build(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\relax\vm_build.py", line 341, in build    return _vmlink(
           ^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\relax\vm_build.py", line 247, in _vmlink
    lib = tvm.build(
          ^^^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\driver\build_module.py", line 297, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "D:\a\package\package\tvm\src\target\spirv\ir_builder.cc", line 566
InternalError: Check failed: (spirv_support_.supports_float16) is false: Vulkan target does not support Float16 capability.  If your device supports 16-bit float operations, please either add -supports_float16=1 to the target, or query all device parameters by adding -from_device=0.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Scripts\mlc_llm.exe\__main__.py", line 7, in <module>
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\__main__.py", line 37, in main
    cli.main(sys.argv[2:])
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\cli\chat.py", line 42, in main
    chat(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\chat.py", line 134, in chat
    cm = ChatModule(model, device, chat_config=config, model_lib=model_lib)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\chat_module.py", line 784, in __init__
    self.model_lib = jit.jit(
                     ^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\jit.py", line 166, in jit
    _run_jit(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\jit.py", line 126, in _run_jit
    raise RuntimeError("Cannot find compilation output, compilation failed")
RuntimeError: Cannot find compilation output, compilation failed

(llmENV) C:\Users\mypc>

I suspect it has to do with the capabilities of the videocard. Is there any way to bypass or manually activate this float16 feature? Or is it related to something else?

screenshotVulkan

from mlc-llm.

tqchen avatar tqchen commented on June 14, 2024

error message said you do not support f16, so please try to use a q4f32 variant of the model

from mlc-llm.

tqchen avatar tqchen commented on June 14, 2024
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f32_1-MLC

from mlc-llm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.