GithubHelp home page GithubHelp logo

[Bug] LlaMa-3 doesn't work about mlc-llm HOT 5 OPEN

chongkuiqi avatar chongkuiqi commented on June 6, 2024
[Bug] LlaMa-3 doesn't work

from mlc-llm.

Comments (5)

vinx13 avatar vinx13 commented on June 6, 2024

The error CUDA_ERROR_NO_BINARY_FOR_GPU is likely ude to a mismatch of the cuda arch, you can try specifying the arch in the target

from mlc-llm.

chongkuiqi avatar chongkuiqi commented on June 6, 2024

The error CUDA_ERROR_NO_BINARY_FOR_GPU is likely ude to a mismatch of the cuda arch, you can try specifying the arch in the target

Thanks for reply!But when I specifying the arch, it still outputs :

[10:40:22] /workspace/mlc-llm/cpp/serve/config.cc:683: Estimated total single GPU memory usage: 5736.325 MB (Parameters: 4308.133 MB. KVCache: 1092.268 MB. Temporary buffer: 335.925 MB). The actual usage might be slightly larger than the estimated number.
Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
  File "/home/haige/miniconda3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/haige/miniconda3/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/haige/miniconda3/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
    self._ffi["run_background_loop"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/home/haige/miniconda3/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  13: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
        at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:168
  12: mlc::llm::serve::EngineImpl::Step()
        at /workspace/mlc-llm/cpp/serve/engine.cc:326
  11: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
        at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:235
  10: mlc::llm::serve::GPUSampler::BatchSampleTokensWithProbAfterTopP(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >*)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:179
  9: mlc::llm::serve::GPUSampler::BatchSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, bool, std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >*)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:369
  8: mlc::llm::serve::GPUSampler::ChunkSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, bool)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:450
  7: mlc::llm::serve::GPUSampler::SampleOnGPU(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, bool, int, std::vector<int, std::allocator<int> > const&)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:567
  6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)
  3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
  2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
  1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  0: TVMThrowLastError.cold
TVMError: after determining tmp storage requirements for inclusive_scan: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

Could you please provide some helps ?

from mlc-llm.

NSTiwari avatar NSTiwari commented on June 6, 2024

@chongkuiqi could you please share the build files generated after running prepare_libs.sh for Llama3? Let me try.

from mlc-llm.

chongkuiqi avatar chongkuiqi commented on June 6, 2024

@chongkuiqi could you please share the build files generated after running prepare_libs.sh for Llama3? Let me try.

I didn't use prepare_libs.sh, I just use mlc_llm compile ./dist/Llama-3-8B-Instruct-q4f16_1-MLC/mlc-chat-config.json ... to generate llama3-cuda.so.
I think the above problem is probability due to my GPU Quadro RTX 6000 (SM75) without Flash kernels.

from mlc-llm.

tqchen avatar tqchen commented on June 6, 2024

Likely this was due to older variant of the GPU, you can try to build tvm and mlc from source without flashinfer/thrust https://llm.mlc.ai/docs/install/tvm.html

from mlc-llm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.