Comments (4)
Specific optimizations for smaller kernels (~100m parameters).
- Improve sampling efficiency.
- We may need to merge more models.
This should not be prioritized. Because the core technique of CacheFlow (memory saving) is not helpful for small models at all, but still, they may benefit from iteration-level scheduling.
from vllm.
After the C++ version, we might need to rerun all the experiments with the new implementation.
from vllm.
@zhuohan123 can this work be considered complete?
from vllm.
If interested, I've been building a c++ native deep learning framework for the past few years that I want to get open sourced soon. This framework aims for optimal performance, here is my framework training AlexNet, mostly the kernels here are cublasLt and cuDNN:
I'd certainly like for it to be part of VLLM. Is this something that there'd be interest in? If so I can go ensure that I support (or add support) for all of the pieces that you need and get it connected. I can provide access to the private repo if requested.
from vllm.
Related Issues (20)
- [Bug]: When enforce_eager is True or False, the paged_attention version used is inconsistent
- VRAM USAGE WHEN LOADING THE MODEL
- [Bug]:Segmentation fault encountered while running model HOT 1
- [New Model]: fastspeech2_conformer (just need a new attention mechanism: RelPositionMultiHeadedAttention) HOT 1
- [New Model]: Blip2 Support required
- [Bug]: why the logits is different between 0.4.1 and 0.4.2 HOT 1
- [Bug]: squeezeLLM with sparse could not work.
- [Bug]: ValueError when using LoRA with CohereForCausalLM model
- [RFC]: Support specifying quant_config details in the LLM or Server entrypoints HOT 1
- [Usage]: Vllm AutoAWQ with 4-GPU doesnt utilize GPU
- [Usage]: How to batch requests to chat models with OpenAI server? HOT 2
- [Usage]: prompt_logprompt from endpoint
- Regression in support of customized "role" in OpenAI compatible API (v.0.4.2) HOT 2
- [Bug]: CUDA error when running mistral-7b + lora with tensor_para=8
- [Performance]: Why the avg. througput generation is low?
- [Feature]: Support W4A8KV4 Quantization(QServe/QoQ)
- [Feature]: could paged_attention_v1 support parameter 'attn_bias'
- [Feature]: CI: Test on NVLink-enabled machine HOT 3
- [Feature]: Host CPU Docker image on Docker Hub
- [Bug]: Unexpected Special Tokens in prompt_logprobs Output for Llama3 Prompt HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.