Based on the newest released version <a href="https://github.com/casper-hansen/AutoAWQ

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Updated the performance data base on version: <a class="commit-link" data-hovercard-ty

Benchmark test data about autoawq HOT 8 CLOSED

casper-hansen commented on July 18, 2024

Benchmark test data

from autoawq.

Comments (8)

casper-hansen commented on July 18, 2024

@wanzhenchn

I am looking for a better method of measuring the prefill/context stage. The numbers for the prefill stage are not accurate, unfortunately - as I noted in the pull request, they are roughly 3x better using the old method of measurement.

Additionally, the claimed 2x speedup comes from the choice of a strong GPU and CPU combination. Speed may vary across both GPU and CPU. This may also vary across batch size and context size when using the model.generate functionality (this applies overhead from the transformers library).

My own conclusion is that I want to improve the way we run the benchmark to avoid all overhead. This will be a future improvement.

from autoawq.

wanzhenchn commented on July 18, 2024

Yeah, the numbers for the prefill stage are not accurate, because it is not expected to have worse performance than FP16 models.

The numbers for the decode stage seem counterintuitive with small batch size, because there is no significant improvement.

Hope the better performance in next version~

from autoawq.

casper-hansen commented on July 18, 2024

There can be many reasons for a slow-down as batch size increases. It could be overhead from model.generate or the CUDA kernel. Soon, I will experiment with other standard generator methods like the one from nanoGPT and check all the numbers again.

from autoawq.

wanzhenchn commented on July 18, 2024

Hope and pleasure to see that.

from autoawq.

casper-hansen commented on July 18, 2024

I ran tests today on Vicuna 7B on A30 GPU + AMD EPYC 7402P CPU (Weak GPU+CPU). Working on better measurement of the speed! Speedup will be higher when you use BOTH a strong GPU and CPU.

Batch Size	n_generate	n_context	tokens/s FP16	tokens/s INT4	Speedup
1	32	64	34	67	1.97
1	64	128	34	68	2.00
1	128	256	35	74	2.11
1	256	512	35	69	1.97
1	512	1024	35	64	1.83
1	1024	2048	33	53	1.61
2	32	64	68	125	1.84
2	64	128	68	130	1.91
2	128	256	64	132	2.06
2	256	512	65	120	1.85
2	512	1024	64	100	1.56
2	1024	2048	54	75	1.39

from autoawq.

wanzhenchn commented on July 18, 2024

@casper-hansen

I have fully benchmark test again with llama-13b.

The numbers for the prefill stage must not accurate, the method of measurement should be optimized.
For some cases, the GPU memory doesn't show a significant decrease, and in some cases, it even increases.

from autoawq.

casper-hansen commented on July 18, 2024

@casper-hansen

I have fully benchmark test again with llama-13b.

The numbers for the prefill stage must not accurate, the method of measurement should be optimized.

For some cases, the GPU memory doesn't show a significant decrease, and in some cases, it even increases.

New kernels and more fused layers are coming, making the models even faster. Prefill stage measurement will be optimized at a later stage.

from autoawq.

wanzhenchn commented on July 18, 2024

Updated the performance data base on version: d76125b released on Sep 13.

@casper-hansen

from autoawq.

Benchmark test data about autoawq HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs