GithubHelp home page GithubHelp logo

hengwang / gpumembench Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ekondis/gpumembench

0.0 2.0 0.0 47 KB

A GPU benchmark suite for assessing on-chip GPU memory bandwidth

License: GNU General Public License v2.0

Makefile 6.65% Cuda 30.58% C 18.88% C++ 33.91% Cool 9.98%

gpumembench's Introduction

gpumembench benchmark suite

In this repository a GPU benchmark tool is hosted regarding the evaluation of on-chip GPU memories from a memory bandwidth perspective. In particular, 3 benchmark tools are provided for the assessment of L1-L2-texture caches, shared memory and constant memory cache, respectively.

CUDA and OpenCL implementations are provided.

To build this tools use the provided Makefile. If it is needed you have to set the CUDA_INSTALL_PATH, OPENCL_INSTALL_PATH & OPENCL_LIBRARY_PATH variables in "common.mk" to point to the proper CUDA/OpenCL directories.

Execution results

Here some indicative output results of executions on a GTX-480 are provided:

An extract of the cachebench output follows:

CUDA cachebench (repeated memory cached operations microbenchmark)
------------------------ Device specifications ------------------------
Device:              GeForce GTX 480
CUDA driver version: 8.0
GPU clock rate:      1550 MHz
Memory clock rate:   950 MHz
Memory bus width:    384 bits
WarpSize:            32
L2 cache size:       768 KB
Total global mem:    1530 MB
ECC enabled:         No
Compute Capability:  2.0
Total SPs:           480 (15 MPs x 32 SPs/MP)
Compute throughput:  1488.00 GFlops (theoretical single precision FMAs)
Memory bandwidth:    182.40 GB/sec
-----------------------------------------------------------------------
Total GPU memory 1605042176, free 1481957376
Buffer size: 512MB
Whole cache hierarchy benchmark (L1 & L2 caches)

Read only benchmark
EXCEL header:
Element size,Grid size, Parameters,   , Data size,        ,Execution time,Instr.thr/put,Memory b/w, Ops/sec,Ops/cycle
     (bytes),(threads),(step),(idx/cl),(elements), (bytes),       (msecs),      (GIOPS),  (GB/sec),  (10^9),   per SM
           4,    23040,     1,     512,       512,    2048,         0.624,      302.707,  1210.827, 302.707,   13.020
           4,    23040,     1,    1024,      1024,    4096,         0.549,      343.840,  1375.362, 343.840,   14.789
           4,    23040,     1,    2048,      2048,    8192,         0.549,      343.840,  1375.362, 343.840,   14.789
           4,    23040,     1,    4096,      4096,   16384,         0.549,      343.820,  1375.282, 343.820,   14.788
           4,    23040,     1,    8192,      8192,   32768,         0.548,      344.141,  1376.566, 344.141,   14.802
           4,    23040,     1,       0,     23040,   92160,         0.547,      345.027,  1380.109, 345.027,   14.840
           4,    23040,     2,       0,     46080,  184320,         0.684,      275.851,  1103.403, 275.851,   11.865
           4,    23040,     3,       0,     69120,  276480,         0.949,      198.902,   795.608, 198.902,    8.555
           4,    23040,     4,       0,     92160,  368640,         0.960,      196.641,   786.563, 196.641,    8.458
           4,    23040,     5,       0,    115200,  460800,         1.993,       94.725,   378.900,  94.725,    4.074
           4,    23040,     6,       0,    138240,  552960,         2.758,       68.444,   273.776,  68.444,    2.944
           4,    23040,     7,       0,    161280,  645120,         2.793,       67.583,   270.332,  67.583,    2.907
           4,    23040,     8,       0,    184320,  737280,         2.875,       65.648,   262.590,  65.648,    2.824
           4,    23040,     9,       0,    207360,  829440,         3.107,       60.751,   243.006,  60.751,    2.613
           4,    23040,    10,       0,    230400,  921600,         3.316,       56.924,   227.696,  56.924,    2.448
           4,    23040,    11,       0,    253440, 1013760,         3.665,       51.501,   206.005,  51.501,    2.215
           4,    23040,    12,       0,    276480, 1105920,         4.146,       45.524,   182.096,  45.524,    1.958
           4,    23040,    13,       0,    299520, 1198080,         4.795,       39.364,   157.454,  39.364,    1.693
           4,    23040,    14,       0,    322560, 1290240,         4.455,       42.370,   169.481,  42.370,    1.822
           4,    23040,    15,       0,    345600, 1382400,         4.634,       40.733,   162.933,  40.733,    1.752
           4,    23040,    16,       0,    368640, 1474560,         4.979,       37.906,   151.624,  37.906,    1.630
           4,    23040,    18,       0,    414720, 1658880,         4.761,       39.646,   158.585,  39.646,    1.705
           4,    23040,    20,       0,    460800, 1843200,         4.626,       40.802,   163.206,  40.802,    1.755
           4,    23040,    22,       0,    506880, 2027520,         4.888,       38.615,   154.460,  38.615,    1.661
           4,    23040,    24,       0,    552960, 2211840,         4.882,       38.660,   154.641,  38.660,    1.663
           4,    23040,    28,       0,    645120, 2580480,         4.508,       41.871,   167.483,  41.871,    1.801
           4,    23040,    32,       0,    737280, 2949120,         4.883,       38.657,   154.628,  38.657,    1.663
           4,    23040,    40,       0,    921600, 3686400,         4.863,       38.813,   155.253,  38.813,    1.669
           4,    23040,    48,       0,   1105920, 4423680,         4.864,       38.807,   155.229,  38.807,    1.669
           4,    23040,    56,       0,   1290240, 5160960,         4.513,       41.818,   167.273,  41.818,    1.799
           4,    23040,    64,       0,   1474560, 5898240,         4.879,       38.684,   154.735,  38.684,    1.664
...

Peak bandwidth measurements per element size and access type
	Read only accesses:
		int1:    1380.11 GB/sec
		int2:    1479.92 GB/sec
		int4:    1458.74 GB/sec
		max:     1479.92 GB/sec
	Read-write accesses:
		int1:     423.37 GB/sec
		int2:     419.64 GB/sec
		int4:     342.76 GB/sec
		max:      423.37 GB/sec

shmembench execution output:

CUDA shmembench (shared memory bandwidth microbenchmark)
------------------------ Device specifications ------------------------
Device:              GeForce GTX 480
CUDA driver version: 8.0
GPU clock rate:      1550 MHz
Memory clock rate:   950 MHz
Memory bus width:    384 bits
WarpSize:            32
L2 cache size:       768 KB
Total global mem:    1530 MB
ECC enabled:         No
Compute Capability:  2.0
Total SPs:           480 (15 MPs x 32 SPs/MP)
Compute throughput:  1488.00 GFlops (theoretical single precision FMAs)
Memory bandwidth:    182.40 GB/sec
-----------------------------------------------------------------------
Total GPU memory 1605042176, free 1481957376
Buffer sizes: 3x8MB
Kernel execution time
	benchmark_shmem  (32bit):    57.964 msecs
	benchmark_shmem  (64bit):    57.943 msecs
	benchmark_shmem (128bit):    87.491 msecs
Total operations executed
	shared memory traffic    :          86 GB
	shared memory operations : 21487419392 operations (32bit)
	shared memory operations : 10743709696 operations (64bit)
	shared memory operations :  5371854848 operations (128bit)
Memory throughput
	using  32bit operations   : 1482.81 GB/sec (370.70 billion accesses/sec)
	using  64bit operations   : 1483.35 GB/sec (185.42 billion accesses/sec)
	using 128bit operations   :  982.38 GB/sec ( 61.40 billion accesses/sec)
	peak operation throughput :  370.70 Giga ops/sec
Normalized per SM
	shared memory operations per clock (32bit) :  239.16 (per SM 15.94)
	shared memory operations per clock (64bit) :  119.63 (per SM  7.98)
	shared memory operations per clock (128bit):   39.61 (per SM  2.64)

constbench execution output:

constbench (constant memory bandwidth microbenchmark)
------------------------ Device specifications ------------------------
Device:              GeForce GTX 480
CUDA driver version: 8.0
GPU clock rate:      1550 MHz
Memory clock rate:   950 MHz
Memory bus width:    384 bits
WarpSize:            32
L2 cache size:       768 KB
Total global mem:    1530 MB
ECC enabled:         No
Compute Capability:  2.0
Total SPs:           480 (15 MPs x 32 SPs/MP)
Compute throughput:  1488.00 GFlops (theoretical single precision FMAs)
Memory bandwidth:    182.40 GB/sec
-----------------------------------------------------------------------
Total GPU memory 1605042176, free 1480908800
Kernel execution time
	benchmark_constant  (32bit):   12.3310 msecs
	benchmark_constant  (64bit):    7.9485 msecs
	benchmark_constant (128bit):    9.9482 msecs
Total operations executed
	constant memory array size :        4096 bytes
	constant memory traffic    :       17180 MB
	constant memory operations :  4294967296 operations (32bit)
	constant memory operations :  2147483648 operations (64bit)
	constant memory operations :  1073741824 operations (128bit)
Memory throughput
	using  32bit operations : 1393.23 GB/sec (348.31 billion accesses/sec)
	using  64bit operations : 2161.40 GB/sec (270.18 billion accesses/sec)
	using 128bit operations : 1726.93 GB/sec (107.93 billion accesses/sec)
Normalized per SM
	Constant memory operations per clock (32bit) :  224.71 (per SM 14.98)
	Constant memory operations per clock (64bit) :  174.31 (per SM 11.62)
	Constant memory operations per clock (128bit):   69.63 (per SM  4.64)
Compute overhead
	Addition operations per constant memory operation  (32bit): 1
	Addition operations per constant memory operation  (64bit): 2
	Addition operations per constant memory operation (128bit): 4

Publications

If you find this benchmark tool useful for your research please don't forget to provide citation to the following paper:

Konstantinidis, E.; Cotronis, Y., "A quantitative performance evaluation of fast on-chip memories of GPUs", 24th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Heraklion, Crete, Greece, pp. 448-455, 2016
doi: 10.1109/PDP.2016.56

gpumembench's People

Contributors

ekondis avatar

Watchers

James Cloos avatar Heng Wang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.