camel-cdr / rvv-bench Goto Github PK

View Code? Open in Web Editor NEW

76.0 76.0 9.0 150 KB

A collection of RISC-V Vector (RVV) benchmarks to help developers write portably performant RVV code

License: MIT License

Makefile 1.33% C 44.83% Assembly 53.34% Shell 0.20% C++ 0.31%

benchmark risc-v rvv

rvv-bench's Introduction

rvv-bench's People

Contributors

Stargazers

Watchers

Forkers

manwjh omaghiarimg weidingliu openxiangshan cyyself furuame sunshaoce marekpikula qiujiandong

rvv-bench's Issues

Testing on Canaan K230

Hi i really like your project, i'm writing some benchmarks myself to test the rvv capabilities of the K230 for a project.

First of all i wanted to just compile the memcpy example that is provided by the spec, and while running it using the instructions described by the k230 docs it doesn't find for some reason the functions even though i'm running using the riscv64 option in the crosscompiler, and the v flag too.

This is the error for more context:

Latency vs throughput

The current benchmarks mix testing latency & throughput (e.g. vadd.vv v8,v16,v24 gives throughput, but vadd.vv v8,v16,v24,v0.t - latency, due to the chained dependency over v8). Would be useful to have separate tests for throughput & latency, especially the reductions. Though it's by no means trivial, as measuring throughput for the destructive instrs would mean cycling output registers, doing which well depends on LMUL, and latency needs manual picking for which operand(s) to test.

(also, a side-note - the Sipeed results page says it has a 256-bit ALU, but from the benchmarks it seems like a 128-bit ALU?)

Missing valid instructions

Widening reductions (vfwredosum.vs, vfwredusum.vs, vwredsum.vs, vwredsumu.vs) should allow LMUL=8
vrgatherei16.vv should only disallow LMUL=8 for e8

Issue with 'illegal instruction' when using bench with spike

Hello：
I am attempting to execute the bench on spike, and after running 'make all,' I encounter the following problem when attempting to execute the generated executable with spike:

li@h107:~/rvv-bench/bench$ ~/tools/riscv-isa-sim/build/spike --isa=rv64gcv1p0 -l --log-commits --log="memcpy.spike" `which pk` memcpy
bbl loader
z  0000000000000000 ra 0000000000000000 sp 0000003ffffffb40 gp 0000000000000000
tp 0000000000000000 t0 0000000000000000 t1 0000000000000000 t2 0000000000000000
s0 0000000000000000 s1 0000000000000000 a0 0000000000014048 a1 0000000000014000
a2 0000000000000000 a3 0000000000000000 a4 0000000000000000 a5 0000000000000000
a6 0000000000000000 a7 0000000000000000 s2 0000000000000000 s3 0000000000000000
s4 0000000000000000 s5 0000000000000000 s6 0000000000000000 s7 0000000000000000
s8 0000000000000000 s9 0000000000000000 sA 0000000000000000 sB 0000000000000000
t3 0000000000000000 t4 0000000000000000 t5 0000000000000000 t6 0000000000000000
pc 000000000001134c va/inst 00000000c00025f3 sr 8000000200006620
An illegal instruction was executed!

The relevant portion in the log file is as follows:

   core   0: 0x000000000001134c (0xc00025f3) csrr    a1, cycle
   core   0: exception trap_illegal_instruction, epc 0x000000000001134c
   core   0:           tval 0x00000000c00025f3
   core   0: >>>>  trap_vector

What could be the cause of this issue, and do you have any suggestions for resolving it?
By the way, I'm using the following version of the clang compiler:

li@h107:~/rvv-bench/bench$ clang -v
clang version 15.0.0 (https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi.git 142ea58f56d9622cc03d43e6ecffd9634d801546)
Target: riscv64-unknown-linux-gnu
Thread model: posix

work together to benchmark the K230?

I think we are in the same timezone? I received mine yesterday. Feel free to DM me on Fediverse (link in my profile) or michael.crusoe@fu-berlin.de

Question about RVV instruction throughput

Hi, I saw each RVV instruction throughput result here:
https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.html

If I want to test the execution throughput of each RVV instructions in other RISC-V board, could you give me guides ?

And I wonder whether how you measure the execution throughput ?

Thanks,

Benchmarks errors - byteswap and mergelines

Hello, noticed a couple of errors on two benchmarks.
Using Clang17 and QEMU user-mode v8.2.2.

Byteswap works with vlen=128, but fails with vlen=512

$ qemu-riscv64 -L /path/sysroot -cpu rv64,v=true,vext_spec=v1.0,vlen=128,elen=64,rvv_ma_all_1s=on,rvv_ta_all_1s=on byteswap
title: "byteswap32",
labels: ["0","scalar","scalar_autovec","SWAR_rev8","rvv_gather_m1","rvv_gather_m2","rvv_gather_m4","rvv_gather_m8","rvv_m1_gathers_m2","rvv_m1_gathers_m4","rvv_m1_gathers_m8",],
data: [
[1,4,7,11,15,20,25,31,38,46,55,65,77,91,107,125,145,168,195,225,260,300,345,397,456,524,601,689,790,905,],
[0.0046500,0.0165016,0.0266615,0.0376712,0.0468530,0.0555709,0.0631552,0.0694055,0.0767754,0.0821501,0.0878033,0.0923689,0.0972099,0.1010213,0.1019727,0.1057261,0.1086183,0.1115167,0.1137125,0.1153668,0.1175035,0.1192748,0.1211652,0.1222046,0.1239340,0.1251552,0.1258427,0.1256531,0.1279714,0.1278302,],
[0.0014713,0.0026027,0.0030008,0.0038378,0.0037826,0.0042062,0.0046242,0.0043641,0.0044947,0.0045521,0.0046614,0.0048028,0.0048593,0.0048084,0.0049773,0.0048881,0.0050663,0.0049632,0.0050065,0.0050328,0.0050012,0.0050286,0.0050001,0.0050195,0.0049937,0.0049708,0.0050973,0.0050261,0.0050519,0.0049986,],
[0.0049825,0.0173197,0.0304546,0.0428432,0.0559805,0.0669568,0.0754147,0.0847689,0.0916214,0.0970464,0.1064653,0.1108458,0.1155115,0.1173663,0.1197001,0.1215657,0.1259664,0.1288047,0.1309339,0.1355176,0.1373662,0.1383923,0.1427802,0.1434300,0.1458803,0.1467129,0.1455434,0.1488442,0.1468497,0.1478202,],
[0.0014785,0.0044116,0.0062943,0.0053854,0.0065697,0.0069725,0.0072769,0.0080157,0.0083903,0.0087342,0.0090036,0.0091369,0.0094278,0.0096013,0.0097712,0.0100587,0.0100582,0.0107297,0.0104014,0.0104191,0.0105666,0.0105055,0.0107629,0.0107234,0.0108763,0.0108005,0.0108411,0.0108356,0.0109395,0.0108678,],
[0.0013551,0.0040066,0.0058245,0.0075257,0.0084834,0.0071947,0.0079648,0.0083950,0.0086504,0.0091219,0.0095271,0.0097441,0.0102191,0.0105451,0.0106993,0.0108772,0.0109936,0.0111641,0.0112588,0.0114477,0.0116144,0.0117276,0.0118299,0.0118097,0.0119177,0.0120204,0.0119781,0.0120790,0.0121374,0.0122073,],
[0.0011696,0.0036005,0.0053128,0.0069343,0.0078912,0.0091112,0.0097462,0.0105503,0.0092220,0.0095850,0.0103388,0.0100930,0.0104965,0.0110938,0.0112841,0.0117006,0.0117801,0.0119972,0.0120910,0.0123152,0.0124819,0.0125629,0.0127869,0.0128397,0.0129573,0.0130012,0.0130398,0.0131275,0.0131873,0.0132576,],
[0.0003554,0.0013160,0.0021040,0.0030942,0.0038266,0.0047222,0.0055169,0.0062209,0.0070523,0.0078089,0.0084536,0.0082295,0.0086859,0.0092035,0.0098196,0.0103544,0.0104557,0.0108863,0.0112735,0.0115949,0.0118787,0.0121087,0.0123095,0.0125833,0.0128137,0.0129303,0.0130291,0.0132421,0.0132818,0.0134574,],
[0.0012097,0.0038954,0.0059402,0.0077870,0.0087719,0.0073405,0.0083107,0.0089218,0.0087027,0.0093048,0.0094767,0.0096255,0.0102275,0.0105066,0.0107003,0.0108150,0.0108046,0.0111121,0.0111394,0.0112618,0.0113717,0.0114867,0.0116322,0.0116001,0.0117121,0.0118493,0.0118084,0.0117919,0.0118795,0.0119491,],
[0.0009552,0.0031055,0.0048302,0.0065824,0.0075910,0.0089389,0.0099298,0.0109254,0.0089792,0.0096450,0.0103281,0.0099143,0.0105348,0.0112607,0.0112215,0.0117596,0.0116259,0.0117241,0.0118388,0.0118858,0.0121052,0.0122979,0.0125464,0.0125380,0.0125672,0.0127383,0.0127788,0.0128276,0.0129405,0.0129474,],
[0.0006662,0.0023483,0.0037179,0.0052322,0.0062513,0.0075045,0.0084042,0.0094338,0.0104652,0.0112006,0.0121532,0.0098240,0.0104670,0.0111270,0.0118574,0.0124812,0.0116400,0.0123124,0.0120734,0.0125895,0.0124900,0.0129648,0.0129154,0.0129544,0.0130211,0.0131163,0.0132427,0.0133981,0.0133794,0.0134595,],
]
}

$ qemu-riscv64 -L /path/sysroot -cpu rv64,v=true,vext_spec=v1.0,vlen=512,elen=64,rvv_ma_all_1s=on,rvv_ta_all_1s=on byteswap
{
title: "byteswap32",
labels: ["0","scalar","scalar_autovec","SWAR_rev8","rvv_gather_m1","rvv_gather_m2","rvv_gather_m4","rvv_gather_m8","rvv_m1_gathers_m2","rvv_m1_gathers_m4","rvv_m1_gathers_m8",],
data: [
[1,4,7,11,15,20,25,31,38,46,55,65,77,91,107,125,145,168,195,225,260,300,345,397,456,524,601,689,790,905,],
[0.0046232,0.0168279,0.0268096,0.0379506,0.0470662,0.0558815,0.0634759,0.0702072,0.0775668,0.0827635,0.0885739,0.0931699,0.0978647,0.1009484,0.1025591,0.1051613,0.1078708,0.1106209,0.1133490,0.1154349,0.1173523,0.1188730,0.1212525,0.1226179,0.1244558,0.1254218,0.1249857,0.1270209,0.1287231,0.1269480,],
[0.0014588,0.0025883,0.0030218,0.0038137,0.0036593,0.0041891,0.0045090,0.0043190,0.0044974,0.0046147,0.0046874,0.0048142,0.0047680,0.0049211,0.0048735,0.0048466,0.0049891,0.0049948,0.0050898,0.0049800,0.0050794,0.0049472,0.0049827,0.0051025,0.0049873,0.0050706,0.0050288,0.0049810,0.0049940,0.0050172,],
[0.0050505,0.0189259,0.0317676,0.0443906,0.0556173,0.0658761,0.0750750,0.0853406,0.0906596,0.0972207,0.1059424,0.1118953,0.1166136,0.1203942,0.1146101,0.1224769,0.1248546,0.1278587,0.1348547,0.1250243,0.1375479,0.1412761,0.1405266,0.1403893,0.1442034,0.1423313,0.1468252,0.1395810,0.1498383,0.1492131,],
[0.0013756,0.0040914,0.0059164,0.0077383,0.0085336,0.0072066,0.0080392,0.0085229,0.0086000,0.0090740,0.0095342,0.0097117,0.0103074,0.0104799,0.0106513,0.0107848,0.0108269,0.0111189,0.0112184,0.0113854,0.0114687,0.0115362,0.0117546,0.0118154,0.0118665,0.0119207,0.0118742,0.0118940,0.0120250,0.0120864,],
[0.0011532,0.0034171,0.0051336,0.0069887,0.0078001,0.0089849,0.0098609,0.0105338,0.0091402,0.0095520,0.0101833,0.0101041,0.0104680,0.0109363,0.0113258,0.0115639,0.0117853,0.0119617,0.0120667,0.0122801,0.0123838,0.0125980,0.0126651,0.0128561,0.0129542,0.0130202,0.0129671,0.0129660,0.0131470,0.0132038,],
[0.0003504,0.0012902,0.0021331,0.0031063,0.0038752,0.0047633,0.0055105,0.0063310,0.0071059,0.0078610,0.0085404,0.0082978,0.0088599,0.0093152,0.0099135,0.0104550,0.0106189,0.0110978,0.0114681,0.0117524,0.0120230,0.0123103,0.0125244,0.0127897,0.0129674,0.0131182,0.0132822,0.0134193,0.0135214,0.0136688,],
[0.0002018,0.0007586,0.0012931,0.0019323,0.0024856,0.0031519,0.0037317,0.0044095,0.0050598,0.0057171,0.0064036,ERROR: rvv_gather_m8 in byteswap32 at 65

mergelines 2/3 fails after latest commit - "don't do e8 mf8 operations on Zve32x"

$ qemu-riscv64 -L /path/sysroot -cpu rv64,v=true,vext_spec=v1.0,vlen=128,elen=64,rvv_ma_all_1s=on,rvv_ta_all_1s=on mergelines
{
title: "mergelines 2/3",
labels: ["0","scalar","rvv_vslide_m1","rvv_vslide_m2","rvv_vslide_m4","rvv_vslide_m8","rvv_vslide_skip_m1","rvv_vslide_skip_m2","rvv_vslide_skip_m4","rvv_vslide_skip_m8","rvv_mshift_m1","rvv_mshift_m2","rvv_mshift_m4","rvv_mshift_m8","rvv_mshift_skip_m1","rvv_mshift_skip_m2","rvv_mshift_skip_m4","rvv_mshift_skip_m8",],
data: [
[1,4,7,11,15,20,25,31,38,46,55,65,77,91,107,125,145,],
[0.0047551,0.0134748,0.0185185,0.0251055,0.0301537,0.0335598,0.0359453,0.0384973,0.0412013,0.0444315,0.0447318,0.0456444,0.0465439,0.0490486,0.0487127,0.0510381,0.0510096,],
[0.0047281,0.0006678,0.0011083,0.0016468,0.0021347,0.0015323,0.0018439,0.0022009,0.0018855,0.0022044,0.0020336,0.0023467,0.0022326,0.0022193,0.0022440,0.0022822,0.0023522,],
[0.0046544,0.0006635,0.0010994,0.0016342,0.0021172,0.0026722,0.0031885,0.0037434,0.0026113,0.0030026,0.0034305,0.0039428,0.0033048,0.0037234,0.0034429,0.0038392,0.0036572,],
[0.0047370,0.0006642,0.0010978,0.0016327,0.0021207,0.0026706,0.0031768,0.0037392,0.0042681,0.0048114,0.0052925,0.0058214,0.0042709,0.0047567,0.0052225,0.0056493,0.0049711,],
[0.0047003,0.0006641,0.0011000,0.0016393,0.0021290,0.0026776,0.0031858,0.0037369,0.0042869,0.0047973,0.0052971,0.0057497,0.0063244,0.0069388,0.0074540,0.0079079,0.0062214,],
[0.0041186,0.0011234,0.0012765,0.0016857,0.0020868,0.0017808,0.0018895,0.0021860,0.0020133,0.0021946,0.0021100,0.0023387,0.0022261,0.0022489,0.0022427,0.0022843,0.0023447,],
[0.0045787,0.0009839,0.0012597,0.0016954,0.0020838,0.0026026,0.0031062,0.0036344,0.0028346,0.0029712,0.0033538,0.0038688,0.0033025,0.0036474,0.0033883,0.0037620,0.0035885,],
[0.0047472,0.0009467,0.0012836,0.0016210,0.0020907,0.0026206,0.0031297,0.0036581,0.0041799,0.0046632,0.0051307,0.0054933,0.0042561,0.0046345,0.0050946,0.0055013,0.0048412,],
[0.0047950,0.0010649,0.0014533,0.0018101,0.0021539,0.0026203,0.0031245,0.0036500,0.0041553,0.0046539,0.0051120,0.0055509,0.0060875,0.0066366,0.0071343,0.0075295,0.0060216,],
[0.0044603,0.0009954,0.0016426,0.0023894,0.0030604,0.0022044,0.0026296,0.0030785,0.0027111,0.0031162,0.0029306,0.0033595,0.0032028,0.0032000,0.0032369,0.0032999,0.0033976,],
[0.0047517,0.0010020,0.0016359,0.0023997,0.0030359,ERROR: rvv_mshift_m2 in mergelines 2/3 at 20(

EDIT: I've seen that comment that says byteswap only works on <=256 bits.

how to measure vector load performance

Hi,
I find your benchmark to be very valuable. Do you have any good ideas or suggestions for testing the performance (throughput or latency) of various vector load instructions? I would like to explore the vector load performance on the K1 and K230.

Thanks

Add benches for strided load/store with different strides

Just found an issue on K230 when doing some auto-vectorization tests on https://github.com/UoB-HPC/TSVC_2.
The vectorized s1115 is like:

.LBB9_7:                                # %vector.ph
    andi    a6, s6, 256
    vsetvli a2, zero, e32, m2, ta, ma
.LBB9_8:                                # %vector.body
    vl2re32.v   v8, (a4)
    vlse32.v    v10, (a5), s11          # s11 = 1024
    vl2re32.v   v12, (a2)
    vfmacc.vv   v12, v8, v10
    vs2r.v  v12, (a4)
    add a4, a4, s0
    add a2, a2, s0
    sub a3, a3, s9
    add a5, a5, s2
    bnez    a3, .LBB9_8

It seems that strided load/store with strides in [1024, 4096] have a worse performance.
A simple probe code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define DEFINE_VLSE(LMUL)                                                      \
  __attribute__((always_inline)) void vlse_##LMUL(int *base, int stride) {     \
    __asm__("vsetvli	t0, zero, e8, " #LMUL ", ta, ma\n"                     \
            "vlse8.v	v0, (%0), %1" ::"r"(base),                             \
            "r"(stride));                                                      \
  }

DEFINE_VLSE(m1)
DEFINE_VLSE(m2)
DEFINE_VLSE(m4)
DEFINE_VLSE(m8)
DEFINE_VLSE(mf2)
DEFINE_VLSE(mf4)
DEFINE_VLSE(mf8)

int main(int argc, char **argv) {
  int stride = atoi(argv[1]);
  int times = atoi(argv[2]);

  // __attribute__((aligned(64)))
  int data[64 * stride];

#define BENCH_VLSE(LMUL)                                                       \
  {                                                                            \
    clock_t start = clock();                                                   \
    for (int i = 0; i < times; i++)                                            \
      vlse_##LMUL(data, stride);                                               \
    clock_t end = clock();                                                     \
    printf("LMUL: " #LMUL "\tstride: %d\t time: %ld\n", stride, end - start);  \
  }

  BENCH_VLSE(mf8)
  BENCH_VLSE(mf4)
  BENCH_VLSE(mf2)
  BENCH_VLSE(m1)
  BENCH_VLSE(m2)
  BENCH_VLSE(m4)
  BENCH_VLSE(m8)
}

The result is like (I highlight the abnormal results):

	MF8	MF4	MF2	M1	M2	M4	M8
4	38479	51332	76931	128148	230645	435399	844990
8	38521	51333	76922	128128	230579	435395	844891
16	38530	51323	76962	128129	230566	435341	845195
32	38511	51373	76932	128150	230656	435388	845083
64	38529	51322	76947	128205	230624	435417	23954097
128	38517	51338	76926	128128	230608	12351222	31148420
256	38487	51288	76945	128152	5824701	15177587	34006290
512	38526	51292	76943	2855170	7439032	16828930	35689412
1024	38511	51324	1152269	3424329	7957662	17053724	35144136
2048	38520	224200	709725	1396708	4226251	8330476	16689498
4096	38507	317053	640199	1507778	3093916	6358825	12725241
8192	38499	51349	76956	128285	1255252	2483829	4943195
16384	38525	51329	76975	128337	1255245	2484334	4975494

It's weird that we can have a better performance when stride is larger than 4096, so this issue may not be related to crossing cache-lines or pages. It may be an issue about hardware prefetcher.

So, my request is, can we add some benches of this kind of scenario?

RVV Poly1305 Error

Hello, really interesting benchmarks here!
I encountered an error when running the poly1305 vector benchmark:

I'm using GCC 13.1 and QEMU v8.0.

Edit: just noticed the rvv-chacha-poly submodule points to an older commit, changing to latest works.

Bare metal support?

We are working on an open source multiple lane RVV for HPC market https://github.com/chipsalliance/t1 with intensive chaining support. To provide as much as possible memory bandwidth. We don’t support mmu and coherency for now. Thus there is no Linux support.
We wanna provide a bare metal support for this repo, will it be acceptable?

Adding more instructions like scalar?

Hello @camel-cdr,

I recently came to use your benchmarks for RVV performance. It's very easy-to-use, really appreciate your work :)

I'd also like to add some tests for scalar instructions to evaluate their performance. Just wondering would I contribute them back upstream? Since this repository is called rvv-bench, dunno if you like to keep it specific to RVV.

Also I'd like to add a different but simple pattern like to make strong dependency.

add t0, t0, t0

Let me know your thoughts! Thanks.

Zen

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble