🐛 Bug Thanks for your work ! I download the compiled/quantized Ll

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

[Bug] LlaMa-3 doesn't work about mlc-llm HOT 5 OPEN

chongkuiqi commented on June 6, 2024

[Bug] LlaMa-3 doesn't work

from mlc-llm.

Comments (5)

vinx13 commented on June 6, 2024

The error CUDA_ERROR_NO_BINARY_FOR_GPU is likely ude to a mismatch of the cuda arch, you can try specifying the arch in the target

from mlc-llm.

chongkuiqi commented on June 6, 2024

The error CUDA_ERROR_NO_BINARY_FOR_GPU is likely ude to a mismatch of the cuda arch, you can try specifying the arch in the target

Thanks for reply！But when I specifying the arch, it still outputs :

[10:40:22] /workspace/mlc-llm/cpp/serve/config.cc:683: Estimated total single GPU memory usage: 5736.325 MB (Parameters: 4308.133 MB. KVCache: 1092.268 MB. Temporary buffer: 335.925 MB). The actual usage might be slightly larger than the estimated number.
Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
  File "/home/haige/miniconda3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/haige/miniconda3/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/haige/miniconda3/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
    self._ffi["run_background_loop"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/home/haige/miniconda3/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  13: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
        at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:168
  12: mlc::llm::serve::EngineImpl::Step()
        at /workspace/mlc-llm/cpp/serve/engine.cc:326
  11: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
        at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:235
  10: mlc::llm::serve::GPUSampler::BatchSampleTokensWithProbAfterTopP(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >*)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:179
  9: mlc::llm::serve::GPUSampler::BatchSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, bool, std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >*)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:369
  8: mlc::llm::serve::GPUSampler::ChunkSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, bool)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:450
  7: mlc::llm::serve::GPUSampler::SampleOnGPU(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, bool, int, std::vector<int, std::allocator<int> > const&)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:567
  6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)
  3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
  2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
  1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  0: TVMThrowLastError.cold
TVMError: after determining tmp storage requirements for inclusive_scan: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

Could you please provide some helps ?

from mlc-llm.

NSTiwari commented on June 6, 2024

@chongkuiqi could you please share the build files generated after running prepare_libs.sh for Llama3? Let me try.

from mlc-llm.

chongkuiqi commented on June 6, 2024

@chongkuiqi could you please share the build files generated after running prepare_libs.sh for Llama3? Let me try.

I didn't use prepare_libs.sh, I just use mlc_llm compile ./dist/Llama-3-8B-Instruct-q4f16_1-MLC/mlc-chat-config.json ... to generate llama3-cuda.so.
I think the above problem is probability due to my GPU Quadro RTX 6000 (SM75) without Flash kernels.

from mlc-llm.

tqchen commented on June 6, 2024

Likely this was due to older variant of the GPU, you can try to build tvm and mlc from source without flashinfer/thrust https://llm.mlc.ai/docs/install/tvm.html

from mlc-llm.

[Bug] LlaMa-3 doesn't work about mlc-llm HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs