GithubHelp home page GithubHelp logo

fluder-paradyne / vllm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vllm-project/vllm

0.0 0.0 0.0 3.4 MB

A high-throughput and memory-efficient inference and serving engine for LLMs

Home Page: https://vllm.readthedocs.io

License: Apache License 2.0

Shell 0.55% C++ 0.52% Python 82.34% C 0.08% Cuda 16.28% Dockerfile 0.23%

vllm's Introduction

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Discord |


Latest News ๐Ÿ”ฅ

  • [2023/10] We hosted the first vLLM meetup in SF! Please find the meetup slides here.
  • [2023/09] We created our Discord server! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
  • [2023/09] We released our PagedAttention paper on arXiv!
  • [2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
  • [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
  • [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
  • [2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server

vLLM seamlessly supports many Hugging Face models, including the following architectures:

  • Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
  • Baichuan (baichuan-inc/Baichuan-7B, baichuan-inc/Baichuan-13B-Chat, etc.)
  • BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
  • ChatGLM (THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.)
  • Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
  • GPT-2 (gpt2, gpt2-xl, etc.)
  • GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
  • GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
  • GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
  • LLaMA & LLaMA-2 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
  • MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
  • OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
  • Phi-1.5 (microsoft/phi-1_5, etc.)
  • Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)
  • Yi (01-ai/Yi-6B, 01-ai/Yi-34B, etc.)

Install vLLM with pip or from source:

pip install vllm

Getting Started

Visit our documentation to get started.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

vllm's People

Contributors

woosukkwon avatar zhuohan123 avatar yard1 avatar liuxiaoxuanpku avatar hermitsun avatar sanster avatar gesanqiu avatar esmeetu avatar wrran avatar twaka avatar beginlner avatar ftgreat avatar codethazine avatar michaelvll avatar zxdvd avatar wangruohui avatar oliver-ss avatar suquark avatar simon-mo avatar moeeddar avatar eltociear avatar blahblahasdf avatar tmm1 avatar wanmok avatar wenfeiy-db avatar cauyxy avatar yhpeter avatar ymwangg avatar yunfeng-scale avatar zhangir-azerbayev avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.