GithubHelp home page GithubHelp logo

Comments (16)

tqchen avatar tqchen commented on June 1, 2024 2

I think we could update engine to have a runtime compatibility check, so we can leverage flashinfer wheel. However, looking a bit more, because there are so many compute compact vector that increases binary size, so we have to cut down old compute compact in a single wheel. So likely a separate build is needed. Likely build from source is best course of action as of now. We can consider have runtime option to turn off flashinfer

from mlc-llm.

bayley avatar bayley commented on June 1, 2024

Looks like the problem is in MLCEngine - this is a minimal reproducer (using the latest nightlies):

from mlc_llm import MLCEngine

model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

from mlc-llm.

bayley avatar bayley commented on June 1, 2024

I did confirm the script works on an A4000 RunPod instance, so this is definitely a bug related to pre-SM80 GPUs. I'm happy to help fix (chat works and performs well so this is clearly possible...) if someone gives me some guidance on where to start.

I also tried rebuilding mlc-llm from source (but using the prebuilt mlc-ai nightly wheel) and it didn't help.

from mlc-llm.

tqchen avatar tqchen commented on June 1, 2024

likely we need to build without flashinfer (as it does not support pre-SM80)

@Hzfengsy have some followup replacement that might help

from mlc-llm.

bayley avatar bayley commented on June 1, 2024

Thanks. What exactly do I need to rebuild without flashinfer? I tried explicitly disabling flashinfer (and cutlass) during model lib compilation but it didn't help.

from mlc-llm.

tqchen avatar tqchen commented on June 1, 2024

For now, we might need to build mlc-llm from source by setting flashinfer OFF

https://llm.mlc.ai/docs/install/mlc_llm.html#option-2-build-from-source

from mlc-llm.

bayley avatar bayley commented on June 1, 2024

I tried that and it didn't help, I can go back and double check my build settings to make sure though. I did use the prebuilt mlc-ai wheel, could that be a problem? (mlc-llm built from source, mlc-ai from prebuilt wheel)

from mlc-llm.

tqchen avatar tqchen commented on June 1, 2024

ah yes we also need build tvm from source, since the flashinfer was packaged through runtime there

from mlc-llm.

bayley avatar bayley commented on June 1, 2024

I see, thanks - I'll give it a try. Does it make sense to provide prebuilt wheels that are built without flashinfer? Seems like especially Pascal users could benefit (used P40 is a popular hobby GPU for LLMs)

from mlc-llm.

bayley avatar bayley commented on June 1, 2024

Success, rebuilt TVM from source following the instructions in the docs (I had to install libzstd-dev through apt) and now MLCEngine works.

from mlc-llm.

bayley avatar bayley commented on June 1, 2024

When FlashInfer is disabled, what prefill algorithm is used? I noticed a pretty long prompt processing time on Llama-70B and was wondering if it internally used memory-efficient attention (xformers/Pytorch SDPA) or the naive algorithm.

from mlc-llm.

tqchen avatar tqchen commented on June 1, 2024

We use a TensorIr variant of flashinfer which normally was at 80 to 90 percent of flashinfer efficiency. Note this is for decode, still need to confirm prefill

from mlc-llm.

bayley avatar bayley commented on June 1, 2024

OK. Does the REST server store stats like the chat interface does? Would be useful to check prefill tokens per second, etc. for benchmarking.

from mlc-llm.

tqchen avatar tqchen commented on June 1, 2024

The stats is still something WIP, but indeed that is a great suggestion

from mlc-llm.

Nero10578 avatar Nero10578 commented on June 1, 2024

I see, thanks - I'll give it a try. Does it make sense to provide prebuilt wheels that are built without flashinfer? Seems like especially Pascal users could benefit (used P40 is a popular hobby GPU for LLMs)

Also interested in this for running on Tesla P40s.

from mlc-llm.

bayley avatar bayley commented on June 1, 2024

I see, thanks - I'll give it a try. Does it make sense to provide prebuilt wheels that are built without flashinfer? Seems like especially Pascal users could benefit (used P40 is a popular hobby GPU for LLMs)

Also interested in this for running on Tesla P40s.

It's pretty easy to run the build yourself, just make sure you have the right version of LLVM when building TVM or else you'll get confusing errors.

from mlc-llm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.