Comments (2)
APC improves TPOT because there is less overhead on the system from processing prefills.
vllm has a central LLMEngine
, which runs a step
each timestep. A step
can be either a prefill
or a decode
. So a good mental model for vllm is that it is running decodes for all active requests constantly with "pauses" to process prefills from new requests to add them to the batch.
As a result, TPOT = n_tokens_generated * time_per_decode_step + n_prefills_processed * time_per_prefill_step
Since APC reduces time_per_prefill_step
TPOT is reduced. Additionally, APC indirectly reduces (a) n_prefills_processed
and time_per_decode_step
because it reduces E2E latency and therefore (a) there are on average less prefills to process while a specific request is running and (b) the average batch size is lower.
from vllm.
APC improves TPOT because there is less overhead on the system from processing prefills.
vllm has a central
LLMEngine
, which runs astep
each timestep. Astep
can be either aprefill
or adecode
. So a good mental model for vllm is that it is running decodes for all active requests constantly with "pauses" to process prefills from new requests to add them to the batch.As a result,
TPOT = n_tokens_generated * time_per_decode_step + n_prefills_processed * time_per_prefill_step
Since APC reduces
time_per_prefill_step
TPOT is reduced. Additionally, APC indirectly reduces (a)n_prefills_processed
andtie_per_decode_step
because it reduces E2E latency and therefore (a) there are on average less prefills to process while a specific request is running and (b) the average batch size is lower.
@robertgshaw2-neuralmagic Thank you for your reply~
But through nsys analysis, I saw that in the decoding stage, only the page attention kernel has reduced time consumption, and other kernels are close. Is this reasonable? Why?
kernel name:
void vllm::paged_attention_v1_kernel<unsigned short, unsigned short, (int)128, (int)16, (int)128, (bool)0>(T1 *, const T1 *, const T2 *, const T2 *, int, float, const int *, const int *, int, const float *, int, int, int, float)
with APC vs without APC time cost:
252.511 us vs 382.047 us
13b model , 300 input_len / 20 output_len, bs 100 in A100 TP1
from vllm.
Related Issues (20)
- [Bug]: Mistral 7b inst v0.3 fails to run HOT 1
- [Bug]: HOT 2
- [Usage]: I use llama3. I found that one token is 'Ġor' in tokenizer.get_vocab(). But when I use vllm server, I got ' or' in response. HOT 1
- [Bug]: Command-R incorrect output contains `<EOS_TOKEN>` and seems to do text prediction rather than conversation
- [Misc]: LLM is responding with advertisement HOT 2
- [Bug]: 英伟达最新驱动555.85,vllm运行报错 HOT 2
- [Feature]: Additional metrics to enable better autoscaling / load balancing of vLLM servers in Kubernetes HOT 4
- [Misc]: Understanding Batching Mechanism in Prefill and Decode Phases HOT 1
- [Installation]:
- [Feature]: Add num_requests_preempted metric HOT 1
- Running Vllm on ray cluster, logging stuck at loading
- [Feature]: multi-steps model_runner? HOT 1
- [Bug]: Cannot build cpu docker image
- [Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. HOT 4
- [Usage]: not support for mistralai/Mistral-7B-Instruct-v0.3 HOT 3
- [Bug]: When load model weights, there are infinite loading
- [Misc]: How to use guided decoding and regex as well? HOT 2
- [Feature]: Integration of transformers past_key_values into the vllm kvcache Function HOT 4
- [Bug]: The VRAM usage of calculating log_probs is not considered in profile run HOT 5
- [Bug]: Build/Install Issues with pip install -e . HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.