Benchmarks of different devices I have come across. This repo is migrated from this gist here: https://gist.github.com/chsasank/407df67ac0c848d6259f0340887648a9#file-benchmark-py
I will maintain interesting benchmarks of different devices of I have come across.
I have written a quick script in PyTorch to benchmark GPUs and CPUs. I use fp32 matrix multiplication to measure FLOPs (floating point operations per second). I copy a large tensor to measure bandwidth. These two are the most important metrics for LLM inference. Read this blog for more details on this.
Here's an example run:
(intel) sasank@ubuntu-22-04:~/code/device-benchmarks$ python benchmark.py --device xpu
benchmarking xpu
size, elapsed_time, flops
256, 0.011420178413391113, 0.00293817055963457
304, 0.0003251314163208008, 0.1728191284491545
362, 0.00033059120178222654, 0.28698844823613445
430, 0.0003793954849243164, 0.4191246504468045
512, 0.00037815570831298826, 0.7098543010167228
608, 0.008894515037536622, 0.05053804756110619
724, 0.0004009723663330078, 1.8929156014947033
861, 0.0005517244338989258, 2.3137542649304907
1024, 0.0006966352462768555, 3.0826514441770736
1217, 0.001168060302734375, 3.0862881116333902
1448, 0.001726818084716797, 3.516325684645487
1722, 0.0028204917907714844, 3.620800503449297
2048, 0.016068482398986818, 1.0691656347760172
2435, 0.008600807189941407, 3.35728090542124
2896, 0.013591170310974121, 3.5741173983212615
3444, 0.024279212951660155, 3.3649980718346795
4096, 0.03385140895843506, 4.060065967734354
4870, 0.06302995681762695, 3.6649653222576513
5792, 0.10398786067962647, 3.737085306267269
6888, 0.17345609664916992, 3.7680776333042645
size (GB), elapsed_time, bandwidth
0.004194304, 0.0003708839416503906, 22.61787868914374
0.00593164, 0.0004174232482910156, 28.42026659648161
0.008388608, 0.000445866584777832, 37.62833226975242
0.01186328, 0.0003901243209838867, 60.81794628994683
0.016777216, 0.00044062137603759763, 76.15252873509442
0.023726564, 0.0005816459655761719, 81.58421240486638
0.033554432, 0.0007857322692871094, 85.40932659019785
0.047453132, 0.0010800123214721679, 87.87516782274575
0.067108864, 0.0014967203140258789, 89.67455492000445
0.094906264, 0.002076077461242676, 91.42844211910285
0.134217728, 0.0029109954833984376, 92.21431552570304
0.189812528, 0.004096579551696777, 92.66878653504035
0.268435456, 0.005767607688903808, 93.0838123808032
0.37962506, 0.008129024505615234, 93.39990542229731
Here's a summary of the data I have collected for different devices
Device | Device Type | TFLOPs (FP32) | Memory Bandwidth (GB/s) |
---|---|---|---|
Apple M1 CPU | CPU | 0.8 | 46 |
Apple M1 GPU | GPU | 1.4 | 56 |
Apple M2 CPU | CPU | 1 | 60 |
Apple M2 GPU | GPU | 2 | 90 |
Apple M2 Ultra CPU | CPU | 4 | 311 |
Apple M2 Ultra GPU (76 Core) | GPU | 20 | 636 |
SteamDeck CPU | CPU | 0.17 | 20 |
SteamDeck GPU | GPU | 1.22 | 69 |
Samsung Exynos 2100 | CPU | 0.1 | 16 |
Intel i7-8559U | CPU | 0.2 | 10 |
Intel i7-1360P | CPU | 0.4 | 24 |
Intel i5-12400 | CPU | 0.7 | 26 |
Intel i9-13900K (WSL2) | CPU | 1.2 | 49 |
Intel Xeon Silver 4116 | CPU | 0.5 | 20 |
AMD Ryzen Threadripper 3960X 24-Cores | CPU | 1.4 | 44 |
AMD Ryzen Threadripper PRO 5975WX 32-Cores | CPU | 1.5 | 28 |
AMD Ryzen 5 4600HS | CPU | 0.4 | 22 |
AMD Ryzen 9 5900X | CPU | 1.3 | 29 |
AMD Ryzen 9 7950X | CPU | 1.1 | 28 |
AMD Epyc 7763 Engineering Sample | CPU | 3.2 | 115 |
AMD Epyc 7262 | CPU | 0.5 | 80 |
Nvidia T4 | GPU | 4 | 240 |
Nvidia GeForce GTX 1650 Ti Mobile | GPU | 3 | 172 |
Intel Arc 770 16GB | GPU | 15 | 452 |
Intel Arc 370m | GPU | 4 | 93 |
Nvidia 3090 | GPU | 27 | 831 |
Nvidia 4090 | GPU | 58 | 912 |
Nvidia 4090 (WSL2) | GPU | 53 | 885 |
Nvidia 4090 Super | GPU | 23 | 411 |
Radeon RX 7900 XTX | GPU | 26 | 792 |