GithubHelp home page GithubHelp logo

camel-cdr / rvv-bench Goto Github PK

View Code? Open in Web Editor NEW
76.0 76.0 9.0 150 KB

A collection of RISC-V Vector (RVV) benchmarks to help developers write portably performant RVV code

License: MIT License

Makefile 1.33% C 44.83% Assembly 53.34% Shell 0.20% C++ 0.31%
benchmark risc-v rvv

rvv-bench's Introduction

rvv-bench's People

Contributors

camel-cdr avatar furuame avatar omaghiarimg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

rvv-bench's Issues

Testing on Canaan K230

Hi i really like your project, i'm writing some benchmarks myself to test the rvv capabilities of the K230 for a project.

First of all i wanted to just compile the memcpy example that is provided by the spec, and while running it using the instructions described by the k230 docs it doesn't find for some reason the functions even though i'm running using the riscv64 option in the crosscompiler, and the v flag too.

This is the error for more context:
image

Latency vs throughput

The current benchmarks mix testing latency & throughput (e.g. vadd.vv v8,v16,v24 gives throughput, but vadd.vv v8,v16,v24,v0.t - latency, due to the chained dependency over v8). Would be useful to have separate tests for throughput & latency, especially the reductions. Though it's by no means trivial, as measuring throughput for the destructive instrs would mean cycling output registers, doing which well depends on LMUL, and latency needs manual picking for which operand(s) to test.

(also, a side-note - the Sipeed results page says it has a 256-bit ALU, but from the benchmarks it seems like a 128-bit ALU?)

Missing valid instructions

  • Widening reductions (vfwredosum.vs, vfwredusum.vs, vwredsum.vs, vwredsumu.vs) should allow LMUL=8
  • vrgatherei16.vv should only disallow LMUL=8 for e8

Issue with 'illegal instruction' when using bench with spike

Hello:
I am attempting to execute the bench on spike, and after running 'make all,' I encounter the following problem when attempting to execute the generated executable with spike:

li@h107:~/rvv-bench/bench$ ~/tools/riscv-isa-sim/build/spike --isa=rv64gcv1p0 -l --log-commits --log="memcpy.spike" `which pk` memcpy
bbl loader
z  0000000000000000 ra 0000000000000000 sp 0000003ffffffb40 gp 0000000000000000
tp 0000000000000000 t0 0000000000000000 t1 0000000000000000 t2 0000000000000000
s0 0000000000000000 s1 0000000000000000 a0 0000000000014048 a1 0000000000014000
a2 0000000000000000 a3 0000000000000000 a4 0000000000000000 a5 0000000000000000
a6 0000000000000000 a7 0000000000000000 s2 0000000000000000 s3 0000000000000000
s4 0000000000000000 s5 0000000000000000 s6 0000000000000000 s7 0000000000000000
s8 0000000000000000 s9 0000000000000000 sA 0000000000000000 sB 0000000000000000
t3 0000000000000000 t4 0000000000000000 t5 0000000000000000 t6 0000000000000000
pc 000000000001134c va/inst 00000000c00025f3 sr 8000000200006620
An illegal instruction was executed!

The relevant portion in the log file is as follows:

   core   0: 0x000000000001134c (0xc00025f3) csrr    a1, cycle
   core   0: exception trap_illegal_instruction, epc 0x000000000001134c
   core   0:           tval 0x00000000c00025f3
   core   0: >>>>  trap_vector

What could be the cause of this issue, and do you have any suggestions for resolving it?
By the way, I'm using the following version of the clang compiler:

li@h107:~/rvv-bench/bench$ clang -v
clang version 15.0.0 (https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi.git 142ea58f56d9622cc03d43e6ecffd9634d801546)
Target: riscv64-unknown-linux-gnu
Thread model: posix

work together to benchmark the K230?

I think we are in the same timezone? I received mine yesterday. Feel free to DM me on Fediverse (link in my profile) or michael.crusoe@fu-berlin.de

Benchmarks errors - byteswap and mergelines

Hello, noticed a couple of errors on two benchmarks.
Using Clang17 and QEMU user-mode v8.2.2.

Byteswap works with vlen=128, but fails with vlen=512

$ qemu-riscv64 -L /path/sysroot -cpu rv64,v=true,vext_spec=v1.0,vlen=128,elen=64,rvv_ma_all_1s=on,rvv_ta_all_1s=on byteswap
title: "byteswap32",
labels: ["0","scalar","scalar_autovec","SWAR_rev8","rvv_gather_m1","rvv_gather_m2","rvv_gather_m4","rvv_gather_m8","rvv_m1_gathers_m2","rvv_m1_gathers_m4","rvv_m1_gathers_m8",],
data: [
[1,4,7,11,15,20,25,31,38,46,55,65,77,91,107,125,145,168,195,225,260,300,345,397,456,524,601,689,790,905,],
[0.0046500,0.0165016,0.0266615,0.0376712,0.0468530,0.0555709,0.0631552,0.0694055,0.0767754,0.0821501,0.0878033,0.0923689,0.0972099,0.1010213,0.1019727,0.1057261,0.1086183,0.1115167,0.1137125,0.1153668,0.1175035,0.1192748,0.1211652,0.1222046,0.1239340,0.1251552,0.1258427,0.1256531,0.1279714,0.1278302,],
[0.0014713,0.0026027,0.0030008,0.0038378,0.0037826,0.0042062,0.0046242,0.0043641,0.0044947,0.0045521,0.0046614,0.0048028,0.0048593,0.0048084,0.0049773,0.0048881,0.0050663,0.0049632,0.0050065,0.0050328,0.0050012,0.0050286,0.0050001,0.0050195,0.0049937,0.0049708,0.0050973,0.0050261,0.0050519,0.0049986,],
[0.0049825,0.0173197,0.0304546,0.0428432,0.0559805,0.0669568,0.0754147,0.0847689,0.0916214,0.0970464,0.1064653,0.1108458,0.1155115,0.1173663,0.1197001,0.1215657,0.1259664,0.1288047,0.1309339,0.1355176,0.1373662,0.1383923,0.1427802,0.1434300,0.1458803,0.1467129,0.1455434,0.1488442,0.1468497,0.1478202,],
[0.0014785,0.0044116,0.0062943,0.0053854,0.0065697,0.0069725,0.0072769,0.0080157,0.0083903,0.0087342,0.0090036,0.0091369,0.0094278,0.0096013,0.0097712,0.0100587,0.0100582,0.0107297,0.0104014,0.0104191,0.0105666,0.0105055,0.0107629,0.0107234,0.0108763,0.0108005,0.0108411,0.0108356,0.0109395,0.0108678,],
[0.0013551,0.0040066,0.0058245,0.0075257,0.0084834,0.0071947,0.0079648,0.0083950,0.0086504,0.0091219,0.0095271,0.0097441,0.0102191,0.0105451,0.0106993,0.0108772,0.0109936,0.0111641,0.0112588,0.0114477,0.0116144,0.0117276,0.0118299,0.0118097,0.0119177,0.0120204,0.0119781,0.0120790,0.0121374,0.0122073,],
[0.0011696,0.0036005,0.0053128,0.0069343,0.0078912,0.0091112,0.0097462,0.0105503,0.0092220,0.0095850,0.0103388,0.0100930,0.0104965,0.0110938,0.0112841,0.0117006,0.0117801,0.0119972,0.0120910,0.0123152,0.0124819,0.0125629,0.0127869,0.0128397,0.0129573,0.0130012,0.0130398,0.0131275,0.0131873,0.0132576,],
[0.0003554,0.0013160,0.0021040,0.0030942,0.0038266,0.0047222,0.0055169,0.0062209,0.0070523,0.0078089,0.0084536,0.0082295,0.0086859,0.0092035,0.0098196,0.0103544,0.0104557,0.0108863,0.0112735,0.0115949,0.0118787,0.0121087,0.0123095,0.0125833,0.0128137,0.0129303,0.0130291,0.0132421,0.0132818,0.0134574,],
[0.0012097,0.0038954,0.0059402,0.0077870,0.0087719,0.0073405,0.0083107,0.0089218,0.0087027,0.0093048,0.0094767,0.0096255,0.0102275,0.0105066,0.0107003,0.0108150,0.0108046,0.0111121,0.0111394,0.0112618,0.0113717,0.0114867,0.0116322,0.0116001,0.0117121,0.0118493,0.0118084,0.0117919,0.0118795,0.0119491,],
[0.0009552,0.0031055,0.0048302,0.0065824,0.0075910,0.0089389,0.0099298,0.0109254,0.0089792,0.0096450,0.0103281,0.0099143,0.0105348,0.0112607,0.0112215,0.0117596,0.0116259,0.0117241,0.0118388,0.0118858,0.0121052,0.0122979,0.0125464,0.0125380,0.0125672,0.0127383,0.0127788,0.0128276,0.0129405,0.0129474,],
[0.0006662,0.0023483,0.0037179,0.0052322,0.0062513,0.0075045,0.0084042,0.0094338,0.0104652,0.0112006,0.0121532,0.0098240,0.0104670,0.0111270,0.0118574,0.0124812,0.0116400,0.0123124,0.0120734,0.0125895,0.0124900,0.0129648,0.0129154,0.0129544,0.0130211,0.0131163,0.0132427,0.0133981,0.0133794,0.0134595,],
]
}

$ qemu-riscv64 -L /path/sysroot -cpu rv64,v=true,vext_spec=v1.0,vlen=512,elen=64,rvv_ma_all_1s=on,rvv_ta_all_1s=on byteswap
{
title: "byteswap32",
labels: ["0","scalar","scalar_autovec","SWAR_rev8","rvv_gather_m1","rvv_gather_m2","rvv_gather_m4","rvv_gather_m8","rvv_m1_gathers_m2","rvv_m1_gathers_m4","rvv_m1_gathers_m8",],
data: [
[1,4,7,11,15,20,25,31,38,46,55,65,77,91,107,125,145,168,195,225,260,300,345,397,456,524,601,689,790,905,],
[0.0046232,0.0168279,0.0268096,0.0379506,0.0470662,0.0558815,0.0634759,0.0702072,0.0775668,0.0827635,0.0885739,0.0931699,0.0978647,0.1009484,0.1025591,0.1051613,0.1078708,0.1106209,0.1133490,0.1154349,0.1173523,0.1188730,0.1212525,0.1226179,0.1244558,0.1254218,0.1249857,0.1270209,0.1287231,0.1269480,],
[0.0014588,0.0025883,0.0030218,0.0038137,0.0036593,0.0041891,0.0045090,0.0043190,0.0044974,0.0046147,0.0046874,0.0048142,0.0047680,0.0049211,0.0048735,0.0048466,0.0049891,0.0049948,0.0050898,0.0049800,0.0050794,0.0049472,0.0049827,0.0051025,0.0049873,0.0050706,0.0050288,0.0049810,0.0049940,0.0050172,],
[0.0050505,0.0189259,0.0317676,0.0443906,0.0556173,0.0658761,0.0750750,0.0853406,0.0906596,0.0972207,0.1059424,0.1118953,0.1166136,0.1203942,0.1146101,0.1224769,0.1248546,0.1278587,0.1348547,0.1250243,0.1375479,0.1412761,0.1405266,0.1403893,0.1442034,0.1423313,0.1468252,0.1395810,0.1498383,0.1492131,],
[0.0013756,0.0040914,0.0059164,0.0077383,0.0085336,0.0072066,0.0080392,0.0085229,0.0086000,0.0090740,0.0095342,0.0097117,0.0103074,0.0104799,0.0106513,0.0107848,0.0108269,0.0111189,0.0112184,0.0113854,0.0114687,0.0115362,0.0117546,0.0118154,0.0118665,0.0119207,0.0118742,0.0118940,0.0120250,0.0120864,],
[0.0011532,0.0034171,0.0051336,0.0069887,0.0078001,0.0089849,0.0098609,0.0105338,0.0091402,0.0095520,0.0101833,0.0101041,0.0104680,0.0109363,0.0113258,0.0115639,0.0117853,0.0119617,0.0120667,0.0122801,0.0123838,0.0125980,0.0126651,0.0128561,0.0129542,0.0130202,0.0129671,0.0129660,0.0131470,0.0132038,],
[0.0003504,0.0012902,0.0021331,0.0031063,0.0038752,0.0047633,0.0055105,0.0063310,0.0071059,0.0078610,0.0085404,0.0082978,0.0088599,0.0093152,0.0099135,0.0104550,0.0106189,0.0110978,0.0114681,0.0117524,0.0120230,0.0123103,0.0125244,0.0127897,0.0129674,0.0131182,0.0132822,0.0134193,0.0135214,0.0136688,],
[0.0002018,0.0007586,0.0012931,0.0019323,0.0024856,0.0031519,0.0037317,0.0044095,0.0050598,0.0057171,0.0064036,ERROR: rvv_gather_m8 in byteswap32 at 65

mergelines 2/3 fails after latest commit - "don't do e8 mf8 operations on Zve32x"

$ qemu-riscv64 -L /path/sysroot -cpu rv64,v=true,vext_spec=v1.0,vlen=128,elen=64,rvv_ma_all_1s=on,rvv_ta_all_1s=on mergelines
{
title: "mergelines 2/3",
labels: ["0","scalar","rvv_vslide_m1","rvv_vslide_m2","rvv_vslide_m4","rvv_vslide_m8","rvv_vslide_skip_m1","rvv_vslide_skip_m2","rvv_vslide_skip_m4","rvv_vslide_skip_m8","rvv_mshift_m1","rvv_mshift_m2","rvv_mshift_m4","rvv_mshift_m8","rvv_mshift_skip_m1","rvv_mshift_skip_m2","rvv_mshift_skip_m4","rvv_mshift_skip_m8",],
data: [
[1,4,7,11,15,20,25,31,38,46,55,65,77,91,107,125,145,],
[0.0047551,0.0134748,0.0185185,0.0251055,0.0301537,0.0335598,0.0359453,0.0384973,0.0412013,0.0444315,0.0447318,0.0456444,0.0465439,0.0490486,0.0487127,0.0510381,0.0510096,],
[0.0047281,0.0006678,0.0011083,0.0016468,0.0021347,0.0015323,0.0018439,0.0022009,0.0018855,0.0022044,0.0020336,0.0023467,0.0022326,0.0022193,0.0022440,0.0022822,0.0023522,],
[0.0046544,0.0006635,0.0010994,0.0016342,0.0021172,0.0026722,0.0031885,0.0037434,0.0026113,0.0030026,0.0034305,0.0039428,0.0033048,0.0037234,0.0034429,0.0038392,0.0036572,],
[0.0047370,0.0006642,0.0010978,0.0016327,0.0021207,0.0026706,0.0031768,0.0037392,0.0042681,0.0048114,0.0052925,0.0058214,0.0042709,0.0047567,0.0052225,0.0056493,0.0049711,],
[0.0047003,0.0006641,0.0011000,0.0016393,0.0021290,0.0026776,0.0031858,0.0037369,0.0042869,0.0047973,0.0052971,0.0057497,0.0063244,0.0069388,0.0074540,0.0079079,0.0062214,],
[0.0041186,0.0011234,0.0012765,0.0016857,0.0020868,0.0017808,0.0018895,0.0021860,0.0020133,0.0021946,0.0021100,0.0023387,0.0022261,0.0022489,0.0022427,0.0022843,0.0023447,],
[0.0045787,0.0009839,0.0012597,0.0016954,0.0020838,0.0026026,0.0031062,0.0036344,0.0028346,0.0029712,0.0033538,0.0038688,0.0033025,0.0036474,0.0033883,0.0037620,0.0035885,],
[0.0047472,0.0009467,0.0012836,0.0016210,0.0020907,0.0026206,0.0031297,0.0036581,0.0041799,0.0046632,0.0051307,0.0054933,0.0042561,0.0046345,0.0050946,0.0055013,0.0048412,],
[0.0047950,0.0010649,0.0014533,0.0018101,0.0021539,0.0026203,0.0031245,0.0036500,0.0041553,0.0046539,0.0051120,0.0055509,0.0060875,0.0066366,0.0071343,0.0075295,0.0060216,],
[0.0044603,0.0009954,0.0016426,0.0023894,0.0030604,0.0022044,0.0026296,0.0030785,0.0027111,0.0031162,0.0029306,0.0033595,0.0032028,0.0032000,0.0032369,0.0032999,0.0033976,],
[0.0047517,0.0010020,0.0016359,0.0023997,0.0030359,ERROR: rvv_mshift_m2 in mergelines 2/3 at 20(

EDIT: I've seen that comment that says byteswap only works on <=256 bits.

how to measure vector load performance

Hi,
I find your benchmark to be very valuable. Do you have any good ideas or suggestions for testing the performance (throughput or latency) of various vector load instructions? I would like to explore the vector load performance on the K1 and K230.

Thanks

Add benches for strided load/store with different strides

Just found an issue on K230 when doing some auto-vectorization tests on https://github.com/UoB-HPC/TSVC_2.
The vectorized s1115 is like:

.LBB9_7:                                # %vector.ph
    andi    a6, s6, 256
    vsetvli a2, zero, e32, m2, ta, ma
.LBB9_8:                                # %vector.body
    vl2re32.v   v8, (a4)
    vlse32.v    v10, (a5), s11          # s11 = 1024
    vl2re32.v   v12, (a2)
    vfmacc.vv   v12, v8, v10
    vs2r.v  v12, (a4)
    add a4, a4, s0
    add a2, a2, s0
    sub a3, a3, s9
    add a5, a5, s2
    bnez    a3, .LBB9_8

It seems that strided load/store with strides in [1024, 4096] have a worse performance.
A simple probe code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define DEFINE_VLSE(LMUL)                                                      \
  __attribute__((always_inline)) void vlse_##LMUL(int *base, int stride) {     \
    __asm__("vsetvli	t0, zero, e8, " #LMUL ", ta, ma\n"                     \
            "vlse8.v	v0, (%0), %1" ::"r"(base),                             \
            "r"(stride));                                                      \
  }

DEFINE_VLSE(m1)
DEFINE_VLSE(m2)
DEFINE_VLSE(m4)
DEFINE_VLSE(m8)
DEFINE_VLSE(mf2)
DEFINE_VLSE(mf4)
DEFINE_VLSE(mf8)

int main(int argc, char **argv) {
  int stride = atoi(argv[1]);
  int times = atoi(argv[2]);

  // __attribute__((aligned(64)))
  int data[64 * stride];

#define BENCH_VLSE(LMUL)                                                       \
  {                                                                            \
    clock_t start = clock();                                                   \
    for (int i = 0; i < times; i++)                                            \
      vlse_##LMUL(data, stride);                                               \
    clock_t end = clock();                                                     \
    printf("LMUL: " #LMUL "\tstride: %d\t time: %ld\n", stride, end - start);  \
  }

  BENCH_VLSE(mf8)
  BENCH_VLSE(mf4)
  BENCH_VLSE(mf2)
  BENCH_VLSE(m1)
  BENCH_VLSE(m2)
  BENCH_VLSE(m4)
  BENCH_VLSE(m8)
}

The result is like (I highlight the abnormal results):

MF8 MF4 MF2 M1 M2 M4 M8
4 38479 51332 76931 128148 230645 435399 844990
8 38521 51333 76922 128128 230579 435395 844891
16 38530 51323 76962 128129 230566 435341 845195
32 38511 51373 76932 128150 230656 435388 845083
64 38529 51322 76947 128205 230624 435417 23954097
128 38517 51338 76926 128128 230608 12351222 31148420
256 38487 51288 76945 128152 5824701 15177587 34006290
512 38526 51292 76943 2855170 7439032 16828930 35689412
1024 38511 51324 1152269 3424329 7957662 17053724 35144136
2048 38520 224200 709725 1396708 4226251 8330476 16689498
4096 38507 317053 640199 1507778 3093916 6358825 12725241
8192 38499 51349 76956 128285 1255252 2483829 4943195
16384 38525 51329 76975 128337 1255245 2484334 4975494

It's weird that we can have a better performance when stride is larger than 4096, so this issue may not be related to crossing cache-lines or pages. It may be an issue about hardware prefetcher.

So, my request is, can we add some benches of this kind of scenario?

RVV Poly1305 Error

Hello, really interesting benchmarks here!
I encountered an error when running the poly1305 vector benchmark:
image
I'm using GCC 13.1 and QEMU v8.0.

Edit: just noticed the rvv-chacha-poly submodule points to an older commit, changing to latest works.

Bare metal support?

We are working on an open source multiple lane RVV for HPC market https://github.com/chipsalliance/t1 with intensive chaining support. To provide as much as possible memory bandwidth. We don’t support mmu and coherency for now. Thus there is no Linux support.
We wanna provide a bare metal support for this repo, will it be acceptable?

Adding more instructions like scalar?

Hello @camel-cdr,

I recently came to use your benchmarks for RVV performance. It's very easy-to-use, really appreciate your work :)

I'd also like to add some tests for scalar instructions to evaluate their performance. Just wondering would I contribute them back upstream? Since this repository is called rvv-bench, dunno if you like to keep it specific to RVV.

Also I'd like to add a different but simple pattern like to make strong dependency.

add t0, t0, t0

Let me know your thoughts! Thanks.

Zen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.