GithubHelp home page GithubHelp logo

cnugteren / clblast Goto Github PK

View Code? Open in Web Editor NEW
1.0K 57.0 204.0 6.76 MB

Tuned OpenCL BLAS

License: Apache License 2.0

CMake 0.84% C++ 81.62% C 10.16% Python 3.98% Cython 3.40%
blas opencl blas-libraries clblas matrix-multiplication gemm gpu

clblast's Introduction

 cnugteren

clblast's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clblast's Issues

Provide Visual Studio 2010 support (for python 3.4)

Hi,

I have a couple of neural net projects for OpenCL, ie https://github.com/hughperkins/DeepCL and https://github.com/hughperkins/clnn They're currently using clblas. I'm tentatively interested in investigating using clblast instead, since eg psyhtest recommends it here. I notice you require c++11 I currently support C++0x and Visual Studio 2010 as standard across my projects, for a few reasons:

  • many people still use old version of gcc, like 4.4 or so
  • many people are using Python 3.4 (which is Visual Studio 2010), or Python 2.7 (which is Visual Studio 2008, but 2008 is basically approximately compatible with 2010, in either direction; both are approximately c++0x standard, in my experience, with a couple of smallish discrepancies).

I know that C++11 gives you a few conveniences as a developer, which it's hard to resist :-) To what extent might you consider dropping your compiler requirements from c++11 down to c++0x / Visual Studio 2010?

Hugh

Exposing the transpose kernel to clients

Hi Cedric,

Is there a reason, other than not being in BLAS standard, to not expose transpose functionality to the clients of CLBlast?

Since CLBlast has this, and, probably, other useful utility functions that are already implemented, would you consider making those available through a public API?

Power-of-2 kernels *much* slower than random dimensional kernels

I got this from multiple builds of CLBlas during the past week or so. For reference, my hand-written sgemm for 8192x8192 runs in 600ms. As you can see, on R9 290X with 4G of RAM, 8000x8000 kernel runs expectedly quickly, while 8192x8192 kernel is surprisingly slow. It is not due to particular database options, since with defaults it runs a couple of times slower.

(with-default
    (with-engine clblast-single *command-queue*
      (facts
       "Matrix-matrix multiplication. Matrices of 8192x8192 (268 MB) are usually
demanding enough."
       (let [cnt 8000]
         (with-release [host-a (sge cnt cnt (range (* cnt cnt)))
                        host-b (sge cnt cnt (range (* cnt cnt)))
                        host-c (sge cnt cnt (range (* cnt cnt)))
                        gpu-a (transfer! host-a (clge cnt cnt))
                        gpu-b (transfer! host-a (clge cnt cnt))
                        gpu-c (transfer! host-a (clge cnt cnt))]

           (println "CPU:")
           (time (mm! 3 host-a host-b 2 host-c)) => host-c
           (do (mm! 3 gpu-a gpu-b 2 gpu-c) (finish! *command-queue*))
           (println "GPU:")
           (time (do (mm! 3 gpu-a gpu-b 2 gpu-c) (finish! *command-queue*))) => truthy)))))
CPU:
"Elapsed time: 16537.271501 msecs"
GPU:
"Elapsed time: 280.819371 msecs"
true
u.n.e.g.tutorial-opencl-test>
(with-default
    (with-engine clblast-single *command-queue*
      (facts
       "Matrix-matrix multiplication. Matrices of 8192x8192 (268 MB) are usually
demanding enough."
       (let [cnt 8192]
         (with-release [host-a (sge cnt cnt (range (* cnt cnt)))
                        host-b (sge cnt cnt (range (* cnt cnt)))
                        host-c (sge cnt cnt (range (* cnt cnt)))
                        gpu-a (transfer! host-a (clge cnt cnt))
                        gpu-b (transfer! host-a (clge cnt cnt))
                        gpu-c (transfer! host-a (clge cnt cnt))]

           (println "CPU:")
           (time (mm! 3 host-a host-b 2 host-c)) => host-c
           (do (mm! 3 gpu-a gpu-b 2 gpu-c) (finish! *command-queue*))
           (println "GPU:")
           (time (do (mm! 3 gpu-a gpu-b 2 gpu-c) (finish! *command-queue*))) => truthy)))))
CPU:
"Elapsed time: 17800.345095 msecs"
GPU:
"Elapsed time: 1453.541557 msecs"
true

TestVectorX and relatives in routines.cc ask are broken for subvectors and submatrices

This can be seen in xgemv.cc. Let's say I have a 2x3 general matrix of floats. It's size is 6*4 = 24 bytes.
Suppose I take the second row of that matrix. It is a vector of dimension 3 with x_offset=1 and x_inc=2. Note that the buffer size for that vector is also 24 bytes, since it is actually the buffer of the matrix.
Now, if I try to use that vector in sgemv, it will reach TestVectorX, https://github.com/blueberry/CLBlast/blob/development/src/routine.cc#L215.

That line will demand of the required_size to be (n*inc+offset)*data_size, which is (3*2+1)*4 = 7*4 = 28. And then, the routine would fail with kInsufficientMemoryX.

Actually, the real formula should be ((n-1)*inc + 1 + offset)*data_size, since everything after the last element is not needed by the routines that use the vector.

The same problem is present in TestVectorY, TestMatrixA, TestMatrixB, TestMatrixC and (probably) TestMatrixAP (i didn't use the last one and did not have an opportunity to test it).

the TestMatrixXXX required_size should be:
(ld*(two-1) + one + offset)*data_size

I would have created a pull request, but I have a strange issue with seeing the changes I make in routines.cc in the built library: I change src/routines.cc, go to build, make clean, make, sudo make install, then go to JOCLBlast and recompile and reinstall it. I still get error -1008, even if I comment out the only call to it... If I intentionally introduce a syntax error in some CLBlast file, the compilation fails, so the build definitely sees the changes. Maybe I am not linking properly to the updated code from Java, but I cannot find where is the problem.

Anyway, I hope you can reproduce this subtle bug and fix it in upstream, so I can do a full rebuild, since it is practically a few minor typo fixes.

CLBlast development fails for xGEMM with "Matrix A's OpenCL buffer is too small" on Mali

I'm testing my CLBlast integration with the OpenCL branch of Caffe. I am observing multiple failures when using the CLBlast development branch on at least two Mali-T628 based platforms (Odroid XU3 with driver v4.0; Chromebook 2 with driver v6.0).

I don't understand the error "Matrix A's OpenCL buffer is too small", since the same program works fine with CLBlast built from the 0.6.0. release.

@CNugteren
To reproduce, please log into dividiti's Odroid XU3 and run e.g.:

$ LD_LIBRARY_PATH=/data/install/lib-clblast-dev/lib:/data/install/lib-openblas-v0.2.18/lib:$LD_LIBRARY_PATH \
/data/caffe-dvdt-clblast#55/build/test/test_all.testbin \
--gtest_filter=NetTest/3.TestBackwardWithAccuracyLayer

Setting to use device 0
Note: Google Test filter = NetTest/3.TestBackwardWithAccuracyLayer
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from NetTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] NetTest/3.TestBackwardWithAccuracyLayer
F0505 22:12:46.621851 16035 greentea_math_functions.cpp:275] Check failed: static_cast<int>(status) == static_cast<int>(clblast::StatusCode::kSuccess) (-1011 vs. 0) GREENTEA ERROR: CLBlast error
*** Check failure stack trace: ***
    @ 0xb6ee3060  (unknown)
    @ 0xb6ee2f5c  (unknown)
    @ 0xb6ee2b78  (unknown)
    @ 0xb6ee4f98  (unknown)
    @ 0xb55d4630  caffe::greentea_gpu_gemm<>()
    @ 0xb562d30c  caffe::InnerProductLayer<>::Forward_gpu()
    @ 0xb5500e84  caffe::Net<>::ForwardFromTo()
    @ 0xb55010c8  caffe::Net<>::Forward()
    @   0x302c34  caffe::NetTest_TestBackwardWithAccuracyLayer_Test<>::TestBody_Impl()
    @   0x3e98dc  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @   0x3e4cfa  testing::Test::Run()
    @   0x3e4d8a  testing::TestInfo::Run()
    @   0x3e4e62  testing::TestCase::Run()
    @   0x3e5060  testing::internal::UnitTestImpl::RunAllTests()
    @   0x3e525c  testing::UnitTest::Run()
    @    0xe2f5c  main
    @ 0xb525e632  (unknown)
Aborted

Exactly the same command referring to the 0.6.0 release works absolutely fine:

$ LD_LIBRARY_PATH=/data/install/lib-clblast-0.6.0/lib:/data/install/lib-openblas-v0.2.18/lib:$LD_LIBRARY_PATH \
/data/caffe-dvdt-clblast#55/build/test/test_all.testbin \
--gtest_filter=NetTest/3.TestBackwardWithAccuracyLayer

Setting to use device 0
Note: Google Test filter = NetTest/3.TestBackwardWithAccuracyLayer
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from NetTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] NetTest/3.TestBackwardWithAccuracyLayer
[       OK ] NetTest/3.TestBackwardWithAccuracyLayer (2447 ms)
[----------] 1 test from NetTest/3 (2447 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (2448 ms total)
[  PASSED  ] 1 test.

cl_events are not set properly

The functions in CLBlast receive a pointer to cl_event as their last parameter. This pointer is not properly filled during the actual kernel launches.

(I tried to analyze the path that the given pointer takes through the actual CLBlast C++ infrastructure, before it actually ends in the Launch function in clpp11.h, but can not point out "the" line that causes this error. Somewhere, the cl_event pointer may be copied accidentally)

Steps to reproduce: In the https://github.com/CNugteren/CLBlast/blob/master/samples/sgemm.c sample, after the CLBlastSgemm call, change/extend the clWaitForEvents call as follows:

printf("The event is %p\n", event);
int result = clWaitForEvents(1, &event);
printf("Result of waiting: %d\n", result);

This prints

The event is 0000000000000000
Result of waiting: -58

(where -58 is the CL error code for CL_INVALID_EVENT)

I'd expect the cl_event pointer that is passed to the API function to be properly filled with the pointer that was returned from the internal clEnqueueNDRangeKernel call.

Note: In most OpenCL functions and libraries, the cl_event is optional. This means that it should also be possible to pass a nullptr as the event parameter to the CLBlast functions. Currently, this is not possible (causes a nasty crash/segfault). I assume that this can be fixed together with this issue - otherwise, I'd open it as a dedicated RFE.

Calling CLBlast functions and choosing non-default device

I am trying to use CLBlast from Clojure through JOCLBlast. Quite exotic setup. Now, the problem that I have is that I have a Clojure BLAS library that already does lots of opencl stuff (that I would like to reuse) AND it is OpenCL 2.0 only. This is important because of my 3 GPUs, the 1st (default) is OpenCL 1.2 and 2nd and 3rd are OpenCL 2.0. This led me to (maybe) discover an issue with CLBlast:

I create a queue for the 2nd GPU, and I am trying to call a CLBlast method with that queue. I get the error 2048, which indicated that kernel failed to enqueue. I suspect that this is because CLBlast tries to use a program that was compiled for the 1st GPU, so the queue that I supplied is invalid in that context.

Now, is there a way to choose the GPU (context, device) that CLBlast should use, or it defaults to the 1st one? I can't see methods related to setup in the clblast_c.h or clblast.h. Is it important at all? I am not sure about the inner workings of CLBlast, but it is the only thing that I could think of...

EDIT: I know that there is an option to choose the device in the tuning phase during the build. I need to choose device(s) in the client program, once libclblas has been built and is used as a provided binary.

The future of CLBlast

Hi Cedric,

CLBlast is a great work! I also learned a lot from your blog posts about matrix multiplication.
I've noticed that CLBlast development stalled a bit in the last couple of months, after a period of rapid development.
Do you plan to develop/support it further, or is it on hold due to your other commitments?

SGEMM failures in CLBlast tests

Two failures in SGEMM test, using -full_test, on NVIDIA 940M, with libopenblas installed, on ubuntu 64-bit

* Testing 'regular behaviour' for '1 (col-major) 1 (transposed) 0 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ................................................................
   ................................................................
   ................................................................
   ................................................................
   ................................................................
   ........................................::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::................
   ........................................................::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ................................................................
   ................................................................
   ................................................................
   ................................................................
   ................................................................
   ........................................::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::................
   ........................................................::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ..................::::::::::::::::::..................::::::::::
   ::::::::..................::::::::::::::::::..................::
   ::::::::::::::::..................::::::::::::::::::............
   ......::::::::::::::::::..................::::::::::::::::::....
   ..............::::::::::::::::::..................::::::::::::::
   ::::..................::::::::::::::::::..................::::::
   ::::::::::::..................::::::::::::::::::................
   ..::::::::::::::::::..................::::::::::::::::::........
   ..........::::::::::::::::::..................::::::::::::::::::
   ................................................................
   ................................................................
   ................................................................
   ................................................................
   ................................................................
   ..........................................................::::::
   :X::::::::::..................::::::::::::::::::................
   ................................................................
   ..........::::::::::::::::::..................::::::::::::::::::
   ..................::::::::::::::::::..................::::::::::
   ::::::::..................::::::::::::::::::..................::
   ::::::::::::::::..................::::::::::::::::::............
   ......::::::::::::::::::..................::::::::::::::::::....
   ..............::::::::::::::::::..................::::::::::::::
   ::::..................::::::::::::::::::..................::::::
   ::::::::::::..................::::::::::::::::::................
   ..::::::::::::::::::..................::::::::::::::::::........
   ..........::::::::::::::::::..................::::::::::::::::::
   ................................................................
   ................................................................
   ................................................................
   ................................................................
   ................................................................
   ..........................................................::::::
   :X::::::::::..................::::::::::::::::::................
   ................................................................
   ..........::::::::::::::::::..................::::::::::::::::::
   Error rate 0.2%: m=64 n=7 k=64 layout=1 transA=1 transB=0 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 
   Error rate 0.0%: m=64 n=64 k=64 layout=1 transA=1 transB=0 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 
   Pass rate 100.0%: 4606 passed / 0 skipped / 2 failed

blases installed:

 dpkg -l '*blas*' | grep ii
ii  libblas-common   1.2.20110419-10 amd64        Dependency package for all BLAS implementations
ii  libblas-dev      1.2.20110419-10 amd64        Basic Linear Algebra Subroutines 3, static library
ii  libblas3         1.2.20110419-10 amd64        Basic Linear Algebra Reference implementations, shared library
ii  libopenblas-base 0.2.14-1ubuntu1 amd64        Optimized BLAS (linear algebra) library (shared library)
ii  libopenblas-dev  0.2.14-1ubuntu1 amd64        Optimized BLAS (linear algebra) library (development files)

DDOT failures on Mali-T628

With CLBlast 0.6.0, I observe many DDOT failures on Mali-T628 (Odroid-XU3 with Mali driver v4.0; Chromebook 2 with Mali driver v6.0):

* Running on OpenCL device 'Mali-T628'.
* Starting tests for the 'DDOT' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
* Testing 'regular behaviour' for 'default':
   ///////////////////////////
   Status code -55 (expected 0): n=7 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=7 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=7 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=7 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=7 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=7 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=7 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=7 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=7 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=93 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=93 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=93 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=93 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=93 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=93 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=93 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=93 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=93 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=4096 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=4096 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=4096 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=4096 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=4096 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=4096 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=4096 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=4096 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Status code -55 (expected 0): n=4096 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Pass rate   0.0%: 0 passed / 0 skipped / 27 failed
* Testing 'invalid buffer sizes' for 'default':
   Invalid Size for X
.Invalid Size for X
.Invalid Size for X
.Invalid Size for X
.Invalid Size for X
.Invalid Size for X
.Invalid Size for Y
.Invalid Size for Y
./
   Status code -55 (expected 0): n=64 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Pass rate  88.9%: 8 passed / 0 skipped / 1 failed
* Completed all test-cases for this routine. Results:
   8 test(s) passed
   0 test(s) skipped
   28 test(s) failed

Tuning to 2nd (non-default) device

I have 3 GPUS:

  1. (default) AMD Radeon R9 270X (Pitcairn)
  2. AMD Radeon R9 290X (Hawaii)
  3. AMD Radeon R9 290X (Hawaii)

When I start tuning, it automatically uses the default card. I would like to choose one of two Hawaii chips which are more powerful, and which are more important to me. How to do that?

mesa cl

Hi, thanks for creating this project!

I'm the author of https://github.com/fommil/netlib-java and I'd really like to get this working for my Radeon R9 290x on ArchLinux.

I'm trying to do this using purely free/libre software, so I'm installing the mesa implementation of OpenCL. Have you ever looked into it or know if I'd need to do anything special?

Ideally, tests should not depend on clBLAS library

Ideally, tests should not depend on clBLAS library.

For example, you could either compare with atlas/openblas, or use some simple slow-but-probably-correct cpu code perhaps like:

#include <iostream>
#include <sys/types.h>
#include <stdio.h>
#include <string.h>
#include <clBLAS.h>
#include <stdlib.h>
using namespace std;

  cl_int err;
  cl_platform_id platform = 0;
  cl_device_id device = 0;
  cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
  cl_context ctx = 0;
  cl_command_queue queue = 0;
  cl_mem bufA, bufB, bufC;
  cl_event event = NULL;
  int ret = 0;

void clgemm(int colmaj, char transAchar, char transBchar, int M, int N, int K, float alpha, float *A, int lda,
     float *B, int ldb, float beta, float *C, int ldc, float *result) {
clblasTranspose transA = transAchar == 'n' ? clblasNoTrans : clblasTrans;
clblasTranspose transB = transBchar == 'n' ? clblasNoTrans : clblasTrans;

size_t off = 0;
size_t offA = 0;
size_t offB = 0;
size_t offC = 0;

clblasOrder order;
if(colmaj == 1 ) {
  order = clblasColumnMajor;
} else {
  order = clblasRowMajor;
}

  bufA = clCreateBuffer(ctx, CL_MEM_READ_ONLY, M * K * sizeof(*A),
                        NULL, &err);
  bufB = clCreateBuffer(ctx, CL_MEM_READ_ONLY, K * N * sizeof(*B),
                        NULL, &err);
  bufC = clCreateBuffer(ctx, CL_MEM_READ_WRITE, M * N * sizeof(*C),
                        NULL, &err);

  err = clEnqueueWriteBuffer(queue, bufA, CL_TRUE, 0,
      M * K * sizeof(*A), A, 0, NULL, NULL);
  err = clEnqueueWriteBuffer(queue, bufB, CL_TRUE, 0,
      K * N * sizeof(*B), B, 0, NULL, NULL);
  err = clEnqueueWriteBuffer(queue, bufC, CL_TRUE, 0,
      M * N * sizeof(*C), C, 0, NULL, NULL);

  err = clblasSgemm(order, transA, transB, M - off, N - off, K - off,
                       alpha, bufA, offA, lda,
                       bufB, offB, ldb, beta,
                       bufC, offC, ldc,
                       1, &queue, 0, NULL, &event);
  if (err != CL_SUCCESS) {
      printf("clblasSgemmEx() failed with %d\n", err);
      ret = 1;
      exit(1);
  }
  else {
      err = clWaitForEvents(1, &event);
      err = clEnqueueReadBuffer(queue, bufC, CL_TRUE, 0,
                                M * N * sizeof(*result),
                                result, 0, NULL, NULL);
  }

  clReleaseMemObject(bufC);
  clReleaseMemObject(bufB);
  clReleaseMemObject(bufA);

}

void copy(float *target, float *source, int numels ) {
  for(int i = 0; i < numels; i++) {
    target[i] = source[i];
  }
}

// assumes row major
void transpose(float *A, int rows, int cols) {
  float *A_ = new float[rows * cols];
  for(int i=0; i < rows; i++ ) {
    for(int j = 0; j< cols; j++) {
      A_[j * rows + i] = A[i * cols + j];
    }
  }
  copy(A, A_, rows * cols);
  delete[] A_;
}

// assumes row major
void mult(float *C, float *A, float *B, int M, int K, int N) {
  for(int m = 0; m < M; m++ ) {
    for(int n = 0; n < N; n++ ) {
      float sum = 0;
      for(int k = 0; k < K; k++ ) {
        sum = sum + A[m * K + k] * B[k * N + n];
      }
      C[m * N + n] = sum;
    }
  }
}

bool test1(int colmaj, int M, int N, int K, int transAint, int transBint) {
  char transa = transAint == 1 ? 't' : 'n';
  char transb = transBint == 1 ? 't' : 'n';
//  cout << "colmaj=" << colmaj << " " << transa << " " << transb << " M=" << M << " K=" << K << " N=" << N << endl;

  float alpha = 1;
  int lda, ldb, ldc;

  if(colmaj == 1) {
    if(transAint == 1) {
       lda = K;
    } else {
       lda = M;
    }
    if(transBint == 1) {
       ldb = N;
    } else {
       ldb = K;
    }
  } else {
    if(transAint == 1) {
       lda = M;    
    } else {
       lda = K;
    }
    if(transBint == 1) {
       ldb = K;    
    } else {
       ldb = N;
    }
  }

  if(colmaj == 1) {
    ldc = M;
  } else {
    ldc = N;
  }

  float beta = 0;
  // assume these are row major, untransposed
  float *A = new float[M * K];
  float *B = new float[K * N];
  float *C = new float[M * N];
  for(int i = 0; i < M * K; i++) {
     A[i] = rand() / (float)RAND_MAX - 0.5f;
  }
  for(int i = 0; i < N * K; i++) {
    B[i] = rand() / (float)RAND_MAX - 0.5f;
  }
  for(int i = 0; i < M * N; i++) {
    C[i] = 0.0f;
  }

//  cout << "op(A):" << endl;
//    for(int m=0; m < M; m++) {
//      for(int k = 0; k < K; k++) {
//        cout << A[m * K + k] << " ";
//      }
//      cout << endl;
//    }

//  cout << "op(B):" << endl;
//    for(int k = 0; k < K; k++) {
//      for(int n=0; n < N; n++) {
//        cout << B[k * N + n] << " ";
//      }
//      cout << endl;
//    }

   float *Aforblas = new float[M*K];
   float *Bforblas = new float[K * N];
   float *Cforblas = new float[M*N];
   copy(Aforblas, A, M*K);
   copy(Bforblas, B, N*K);

  float *Cours = new float[M * N];

  float *Aour = new float[M * K];
  float *Bour = new float[K * N];
  copy(Aour, A, M * K);
  copy(Bour, B, N * K);
  bool flipAforblas = !(colmaj == 1) != !(transAint == 1);
  bool flipBforblas = !(colmaj == 1) != !(transBint == 1);
  if(flipAforblas) {
    transpose(Aforblas, M, K);
   }
  if(flipBforblas) {
    transpose(Bforblas, K, N);
   }
  mult(Cours, A, B, M, K, N);

//  cout<< "result from CPU: " << endl;
//  for(int m = 0; m < M; m++) {
//    for(int n = 0; n < N; n++) {
//      int i = m + n * M;
//      cout << Cours[i] << " ";
//    }
//    cout << endl;
//  }

  float *clout = new float[M * N];
  clgemm(colmaj, transa, transb, M, N, K, alpha, Aforblas, lda,
     Bforblas, ldb, beta, C, ldc, clout);
  if(colmaj == 1 ) {
    transpose(clout, N, M);
  }
  bool ok = true;
  for(int m = 0; m < M; m++) {
    for(int n = 0; n < N; n++) {
      int i = m + n * M;
//      cout << "  " << i << " " << Cours[i] << " " << clout[i] << endl;
      float diff = clout[i] - Cours[i];
      diff = diff < 0 ? - diff : diff;
      if(diff > 0.0001) {
//         cout << "ERROR " << M << " " << N << " " << K << " " << transa << " " << transb << endl;
         ok = false;
//         exit(1);
      }
    }
  }
  if(!ok) {
   cout << "ERROR colmaj=" << colmaj << " M=" << M << " N=" << N << " K=" << K << " transa=" << transa << " transb=" << transb << endl;
  }
  delete[] A;
  delete[] B;
  delete[] C;
  delete[] Cours;
  delete[] clout;
  return ok;
}

int main(int argc, char *argv[]) {
  clewInit();

  err = clGetPlatformIDs(1, &platform, NULL);
  if (err != CL_SUCCESS) {
      printf( "clGetPlatformIDs() failed with %d\n", err );
      return 1;
  }
  cout << "got platforms" << endl;

  err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
  if (err != CL_SUCCESS) {
      printf( "clGetDeviceIDs() failed with %d\n", err );
      return 1;
  }

  props[1] = (cl_context_properties)platform;
  ctx = clCreateContext(props, 1, &device, NULL, NULL, &err);
  if (err != CL_SUCCESS) {
      printf( "clCreateContext() failed with %d\n", err );
      return 1;
  }

  queue = clCreateCommandQueue(ctx, device, 0, &err);
  if (err != CL_SUCCESS) {
      printf( "clCreateCommandQueue() failed with %d\n", err );
      clReleaseContext(ctx);
      return 1;
  }

  err = clblasSetup();
  if (err != CL_SUCCESS) {
      printf("clblasSetup() failed with %d\n", err);
      clReleaseCommandQueue(queue);
      clReleaseContext(ctx);
      return 1;
  }

  test1(1, 1, 7, 1, 0, 0);
  test1(1, 1, 7, 1, 1, 1);
//  for(int colmaj = 0; colmaj <= 1; colmaj++ ) {
  int colmaj = 1; {
    for(int m=1; m <= 16; m++) {
      for(int n=1; n <= 16; n++) {
        for(int k=1; k <= 16; k++) {
          for(int transA =0; transA <= 1; transA++) {
  //        int transA = 1; {
            for(int transB =0; transB <= 1; transB++) {
//            int transB = 0; {
  //            test1();
              test1(colmaj, m, n, k, transA, transB);
            }
          }
        }
      }
    }
  }
  // check:
  // colmaj transa transb res
  // 0 0 0 ok
  // 0 1 0 ok
  // 0 0 1 FAIL 7 1 1 n t
  // 0 1 1 ok
  // 1 0 0 ok
  // 1 1 0 FAIL 1 7 1 t n
  // 1 0 1 ok
  // 1 1 1 ok

  /* Finalize work with clblas. */
  clblasTeardown();

  /* Release OpenCL working objects. */
  clReleaseCommandQueue(queue);
  clReleaseContext(ctx);

  return 0;
}

(I wrote this for clblas, which shows clblas is broken in version 2.4 unfortunately, which is why I'm sort of hunting for a new blas solution for opencl :-P )

Missing add_test() in CMakeLists.txt

Hello,

I'm writing a Gentoo package for CLBlast:
sci-libs/clblast

The Gentoo package manager (Portage) can perform unit tests automatically after build and before installation if they are declared properly in CMakeLists.txt.

You have done this for CLTune using add_test().

I'll create a pull request if I get this working.

Performance parity with clBLAS (for small sizes)

Although complete performance parity with clBLAS is likely out of scope for this project, because clBLAS contains very specific kernels to work around limitations of AMD devices, we'd preferably still be as close as possible to its performance.

Of particular interest are situations where clBLAS outperforms CLBLast on NVIDIA or Intel hardware.

As described in issue #73, small or irregular configurations like (128,361,1152) are up to a factor of 3x slower on AMD, and 2x on NVIDIA, compared to clBLAS.

Error -2048 if context is released and acquired again

This is the actual cause of previously reported Issue #43

I am using CLBlast (development branch) through JOCLBlast, but with enough experience, I pretty much suspect that this issue is due to CLBlast caching (although I didn't try this directly in C++ since my C++ foo is too rusty).

What's happening:

  1. I create device, context and queue, and use them to call some (JO)CLBlast function - works perfectly
  2. I release the queue, create it again, and then call (JO)CLBlast - works perfectly
  3. I release the queue and the context, create both again, and call (JO)CLBlast - I get a -2048 error

It seems to me that somehow the old, cached stuff gets mixed up with the newly provided queue.

Visual Studio runtime libraries are linked dynamically

This may be considered as a non-issue for you, but the Visual Studio runtime libraries are currently linked dynamically. This basically means that someone who wants to use the compiled CLBlast DLL may encounter an error message saying that "MSVCP140.DLL" can not be found.

I'm also not so familiar with the details, but I encountered the same problem in JCuda. A bit of websearching led to https://cmake.org/Wiki/CMake_FAQ#How_can_I_build_my_MSVC_application_with_a_static_runtime.3F . It boils down to some odd linker flag in the end, and I eventually fixed it by defining this file https://github.com/jcuda/jcuda-common/blob/master/JCudaCommon_CMake_flags.txt (which refers to https://github.com/jcuda/jcuda-common/blob/master/CMake_c_flag_overrides.cmake and https://github.com/jcuda/jcuda-common/blob/master/CMake_cxx_flag_overrides.cmake ), and including it before any project definition (this is important). For example, as in https://github.com/jcuda/jcufft/blob/master/JCufftJNI/CMakeLists.txt

Sorry, I know that things like this are somewhat annoying (i.e. distressingly non-academic), but I wanted to point it out.

Error during tuning for AMD R9 290X

My card 0 (default) is AMD R9 270X. I tuned that without problems.

Card 1 is AMD R9 290X. I set DEFAULT_DEVICE to 1, and tune the card using the same steps as for 270X. Everything goes fine until Xgemm tuners. They fail without much information (I include what I see in console below). An interesting thing is that in multiple runs they never fail at the same step. Here it is at step 20, but other times were 76 or some other number. It never reaches the end. I also tried with device 2 (my third card, also 290X) and I also get an error. Another curious fact is that 290X is OpenCL 2.0 capable while 270X is 1.2, so the card with less hardware features passes while the better card fails. Also sgemm WORKS when compiled and used through JOCLBlast and Neanderthal...

* Options given/available:
    -platform 0 [=default]
    -device 1 
    -precision 32 (single) [=default]
    -m 1024 [=default]
    -n 1024 [=default]
    -k 1024 [=default]
    -alpha 2.000000 [=default]
    -beta 2.000000 [=default]
    -fraction 2048.000000 [=default]


[==========] Initializing on platform 0 device 1
[==========] Device name: 'Hawaii' (OpenCL 2.0 AMD-APP (1912.5))

[----------] Testing reference Xgemm
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (25 ms) - 1 out of 1

[----------] Testing kernel Xgemm
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (3 ms) - 1 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (2 ms) - 2 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (3 ms) - 3 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (1 ms) - 4 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (3 ms) - 5 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (1 ms) - 6 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (2 ms) - 7 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (1 ms) - 8 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (3 ms) - 9 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (1 ms) - 10 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (3 ms) - 11 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (2 ms) - 12 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (1 ms) - 13 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (2 ms) - 14 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (2 ms) - 15 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (102 ms) - 16 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (2 ms) - 17 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (3 ms) - 18 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (2 ms) - 19 out of 110
[ RUN      ] Running Xgemm
[       OK ] Completed Xgemm (1 ms) - 20 out of 110
CMakeFiles/alltuners.dir/build.make:57: recipe for target 'CMakeFiles/alltuners' failed
make[3]: *** [CMakeFiles/alltuners] Segmentation fault (core dumped)
CMakeFiles/Makefile2:223: recipe for target 'CMakeFiles/alltuners.dir/all' failed
make[2]: *** [CMakeFiles/alltuners.dir/all] Error 2
CMakeFiles/Makefile2:230: recipe for target 'CMakeFiles/alltuners.dir/rule' failed
make[1]: *** [CMakeFiles/alltuners.dir/rule] Error 2
Makefile:214: recipe for target 'alltuners' failed
make: *** [alltuners] Error 2

Implement batched BLAS routines

Batched operations involve performing many small linear-algebra operations, such as GEMV or GEMM. In particular batched GEMM has become increasingly popular due to deep learning. More parallelism can be exploited when making a single batched BLAS call compared to multiple regular BLAS calls on small matrices. NVIDIA's cuBLAS for example has a batched GEMM interface.

Two potentially related papers are:

Compilation stuck with optimization options

Hi,

just for information, the compilation of the file CMakeFiles/clblast.dir/src/database/database.cpp.o is stuck on armv7hl using 'gcc (SUSE Linux) 4.8.3 20140627'.

GCC tries to compile that file for 15 minutes and then kill itself.

The workaround we have applied is to change the optimization options from '-O3' to '-O0'.
We have tested different GCC (older) but we have still the issue, this issue is not present on x64.

Adding installable `find_package` file (for dependent projects) and bumping CMake dependency

Hello,

I'm trying to make it easier to link to CLBlast from dependent projects (say Caffe). The typical way of writing a module script for find_package() seems like too much boilerplate for my idealistic mind, so I'm trying to use the "config" mode of find_package (second signature in the doc), that is, the install(EXPORT ... DESTINATION ...) command.

(It installs a declarative-like target description file with absolute pathes to the installed files, which may then be searched by the dependent project's CMakeLists.txt and imported to create an "imported target". This is very similar to pkg-config.)

The problem is that exporting targets needs at least CMake version 3.2. Can we bump the requirement from 2.8 to 3.2?

(The current Ubuntu 16.04 LTS has CMake 3.2 in base image and 3.5 in updates.)

'to_string' is not a member of 'std' while cross compile with android-ndk-11c

When I cross compile with android-ndk-11c,I got the error:
CLBlast/include/internal/clpp11.h:129:70: error: 'to_string' is not a member of 'std'
Someone say that using -std=c++11. But I have find it in the CMakeList.txt:

if ("${CMAKE_CXX_COMPILER_ID}" STREQUAL "MSVC")
set(FLAGS "/Ox")
set(FLAGS "${FLAGS} /wd4715")
else ()
set(FLAGS "-O3 -std=c++11")
if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")

I am not familiar with cross compile and ndk. So my cmake configure is:
cmake -DCMAKE_AR=$TOOLCHAIN_DIR/arm-linux-androideabi-ar
-DCMAKE_C_COMPILER=$TOOLCHAIN_DIR/arm-linux-androideabi-gcc
-DCMAKE_CXX_COMPILER=$TOOLCHAIN_DIR/arm-linux-androideabi-g++
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_LINKER=$TOOLCHAIN_DIR/arm-linux-androideabi-ld
-DCMAKE_NM=$TOOLCHAIN_DIR/arm-linux-androideabi-nm
-DCMAKE_OBJCOPY=$TOOLCHAIN_DIR/arm-linux-androideabi-objcopy
-DCMAKE_OBJDUMP=$TOOLCHAIN_DIR/arm-linux-androideabi-objdump
-DCMAKE_RANLIB=$TOOLCHAIN_DIR/arm-linux-androideabi-ranlib
-DCMAKE_STRIP=$TOOLCHAIN_DIR/arm-linux-androideabi-strip
-DCMAKE_INSTALL_PREFIX=$INSTALL_DIR/CLBlast
-DOPENCL_INCLUDE_DIRS=$INSTALL_DIR/MaliOpenCL/include
-DOPENCL_LIBRARIES=$$INSTALL_DIR/MaliOpenCL/lib/libGLES_mali.so $CLBLAST_ROOT

Is that something wrong above or Would you tell me how to solve the problem?

Thank you!

Unit tests and tuners segfault on Linux/Beignet with a Haswell GT2 GPU

I tried to run the test and they all fail the same way. They perform all sub-tests without errors, they report 100% pass rate then segfault:

CLBlast/build $ ./clblast_test_xswap -verbose true

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -full_test [false]
    -verbose [true]
    -clblas 1 [=default]
    -cblas 0 [=default]

* Running on OpenCL device 'Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile'.
* Starting tests for the 'SSWAP' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   . -> Test not completed: Reference CBLAS doesn't output error codes
* Testing 'regular behaviour' for 'default':
   Config: n=7 incx=1 incy=1 offx=0 offy=0 -> :
   Config: n=7 incx=1 incy=2 offx=0 offy=0 -> :
   Config: n=7 incx=1 incy=7 offx=0 offy=0 -> :
   Config: n=7 incx=2 incy=1 offx=0 offy=0 -> :
   Config: n=7 incx=2 incy=2 offx=0 offy=0 -> :
   Config: n=7 incx=2 incy=7 offx=0 offy=0 -> :
   Config: n=7 incx=7 incy=1 offx=0 offy=0 -> :
   Config: n=7 incx=7 incy=2 offx=0 offy=0 -> :
   Config: n=7 incx=7 incy=7 offx=0 offy=0 -> :
   Config: n=93 incx=1 incy=1 offx=0 offy=0 -> :
   Config: n=93 incx=1 incy=2 offx=0 offy=0 -> :
   Config: n=93 incx=1 incy=7 offx=0 offy=0 -> :
   Config: n=93 incx=2 incy=1 offx=0 offy=0 -> :
   Config: n=93 incx=2 incy=2 offx=0 offy=0 -> :
   Config: n=93 incx=2 incy=7 offx=0 offy=0 -> :
   Config: n=93 incx=7 incy=1 offx=0 offy=0 -> :
   Config: n=93 incx=7 incy=2 offx=0 offy=0 -> :
   Config: n=93 incx=7 incy=7 offx=0 offy=0 -> :
   Config: n=4096 incx=1 incy=1 offx=0 offy=0 -> :
   Config: n=4096 incx=1 incy=2 offx=0 offy=0 -> :
   Config: n=4096 incx=1 incy=7 offx=0 offy=0 -> :
   Config: n=4096 incx=2 incy=1 offx=0 offy=0 -> :
   Config: n=4096 incx=2 incy=2 offx=0 offy=0 -> :
   Config: n=4096 incx=2 incy=7 offx=0 offy=0 -> :
   Config: n=4096 incx=7 incy=1 offx=0 offy=0 -> :
   Config: n=4096 incx=7 incy=2 offx=0 offy=0 -> :
   Config: n=4096 incx=7 incy=7 offx=0 offy=0 -> :

   Pass rate 100.0%: 27 passed / 0 skipped / 0 failed
* Testing 'invalid buffer sizes' for 'default':
Segmentation fault

My build settings: -DTESTS=ON -DTUNERS=ON

When run clblast in arm(android), I got a error.

I can run clblast in PC(linux), but the same code in arm(android), it failed. I cross compile clblast and push it to android. However, When the program run in gemm in clblast.cc
in about clblast.cc 1495: auto queue_cpp = Queue(queue);
it failed. But when I print the value of queue:0xb79ee300, it seems that the cl_command_queue
queue is not a nullptr.

so, would you help me to find the reason?

Thank you!

Dgemm bug for certain matrix dimensions where Sgemm works OK

My matrix tests revealed a very, very subtle issue with Dgemm:
Most of my tests work, but this one passes for floats and falls for doubles:
A: 2x3 column matrix ((1 2) (3 4) (5 6))
B: 3x2 column ((1 2 3) (4 5 6))
C: 2x2 column ((1 2) (3 4))

xgemm 2.0 * AB + 3.0 * C should update C to: ((47 62) (107 140))
and Sgemm does!
but Dgemm gives ((47 62) (9 12))

I would have sent you a test case in code, but I am doing this in Clojure, and the code is not more complex than the example I've given here.

I guess that the problem is that you do some pre-transformations and paddings for some matrices, so in some cases, such as this one, some operations are done in "wrong" places in memory.

Or maybe I am not using the API well, but I doubt that, given that all other tests work well, and, what especially puzzles me, that Sgemm works all right in the same example!

dropin blas replacement?

it looks like nvblas can actually directly replace a standard non-gpu blas, ie install in place of openblas etc, and it will then automatically use the gpu to accelerate level 3 blas operations http://docs.nvidia.com/cuda/nvblas/index.html#axzz4IfVfPuKV

I dont think such a thing exists for opencl, but one could presumably make one relativley easily by combining openblas and CLblast, and using CLblast for anything that is n-cubey or so, ie level 3 operations, and openblas for everything else (which would otherwise be dominated by time spent copygin to gpu and back). Thoughts on:

  • if such a thing alrady exists? (not necessarily using CLblast, eg viennacl? clblas? etc)
  • usefulness of such a thing?
  • interest in writing it? (I'm pondeirng writing it, seems like just a few wrapper methods I guess?)

i(a)max buffer element size dependent on buffer (float/double) type size

iXamax requires a imax_buffer argument for receiving the result. the result is an (u)int?

https://github.com/CNugteren/CLBlast/blob/development/src/routines/level1/xamax.cc#L53
shows that the buffer is tested by TestVectorDot, whose size is either 4 bytes (float) or 8 bytes (double).
Howewer, iamax buffer should not be dependent on whether the X buffer is float or double, since the index is always int.
This is not a major issue, since simply providing 8 bytes instead of 4 in iDamax works ok, but might be a source of confusion if not well documented.

Accept NULL for event or document otherwise

Similar to clBLAS, CLBlast accepts two pointer arguments:

cl_command_queue* queue, cl_event* event)

However, clBLAS accepts NULL for the event argument (e.g. see this code in the OpenCL branch of Caffe):

 &queue, 0, NULL, NULL));

This should either be accepted for compatibility with clBLAS or documented.

Failing with CL_INVALID_EVENT_WAIT_LIST

I'm getting a "kKernelLaunchError" status code when launching my kernels. Adding a printf in the CheckError function to see the real underlying error, I'm getting a -57, i.e. CL_INVALID_EVENT_WAIT_LIST.

const size_t M = outputs;         == 96
const size_t N = spatial_out;    == 361
const size_t K = filter_dim;       == 550

auto queue_plain = queue();
auto status = clblast::Gemm(
    clblast::Layout::kRowMajor,
    clblast::Transpose::kNo, clblast::Transpose::kNo,
    M, N, K,
    1.0f,
    weights[0](), 0, filter_dim,
    bufferCol(), 0, spatial_out,
    0.0f,
    bufferOutput(), 0, spatial_out,
    &queue_plain);

Looks like an issue with pad/transpose handling?

Visual Studio 2008/2010 support, and python 2.7/3.4 compatibility

Hi CNugteren,

Per your email just now, a huge chunk of my potentail user-base are Python users, on Windows. The current version of visual studio for Python 2.7 is Visual Studio 2008. For Python 3.4, it is Visual Studio 2010.

Pleaes can you confirm that your c++11 version of clBLAS will compile and run correctly using Visual Studio 2008 and Visual Studio 2010.

By the way, what functionalities in C++11 do you see as 'killer functionalities', that are worth throwing away a huge chunk of one's potential userbase for?

Hugh

clBlast without sudo

Hello,

I am trying to use clBlast on an external PC where I don't have sudo rights. The normal "make build" worked perfectly to 100%. Now I can't do the "sudo make install" because of no sudo rights.

I tried to manipulate the cmake_install.cmake, so that it won't copy the files to /usr/... but instead in a local folder. When I now try to compile the sample for sgemm.c with manuelly linking to the libclblast.so with "-L .../libclblast.so" and with the include path "-I .../include" (the "..." is a placeholder for the original path to each directorys) it can't find the CLBlastSgemm.

The error message is:

/tmp/tmpxft_000016cb_00000000-16_Test.o: In function clBlast(float const*, float const*, float*)': tmpxft_000016cb_00000000-3_Test.cudafe1.cpp:(.text+0x347): undefined reference toCLBlastSgemm'
collect2: error: ld returned 1 exit status

As you can see it is a linker error which I can't get rid off. Can you help me or isn't it possible to install clBlast without sudo rights?

greetings,

Jan

database.py downloads external database, will fail on version mismatch

python ../scripts/database/database.py . ..

Downloading database from 'http://www.cedricnugteren.nl/tuning/clblast.db'...

Loading the database from disk...

Traceback (most recent call last):
File "../scripts/database/database.py", line 283, in
database = LoadDatabase(file_db)
File "../scripts/database/database.py", line 66, in LoadDatabase
return pd.read_pickle(filename)
File "/usr/lib/python2.7/dist-packages/pandas/io/pickle.py", line 60, in read_pickle
return try_read(path)
File "/usr/lib/python2.7/dist-packages/pandas/io/pickle.py", line 57, in try_read
return pc.load(fh, encoding=encoding, compat=True)
File "/usr/lib/python2.7/dist-packages/pandas/compat/pickle_compat.py", line 116, in load
return up.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1096, in load_global
klass = self.find_class(module, name)
File "/usr/lib/python2.7/pickle.py", line 1130, in find_class
import(module)
ImportError: No module named indexes.base

From a bit of digging, the database format isn't compatible between pandas versions.

CLBlast 0.6.0 fails for xDOT with "CL_INVALID_WORK_ITEM_SIZE" on Mali

I'm testing my CLBlast integration with the OpenCL branch of Caffe. I am observing multiple failures when running Caffe tests involving xDOT on at least one Mali-T628 platform (Odroid XU3 with driver v4.0). The error suggests attempting to run too many threads in a particular dimension:

 -55, // CL_INVALID_WORK_ITEM_SIZE: ... or for a specific dimension

I suspect that my integration is incorrect, but I can't quite tell due to insufficient documentation. Using CLBlast for xDOT is disabled in this change, which encloses my code with #ifdef 0 ... #endif. Right above this block you can see the code using clBLAS, which I was trying to mimic. The clBLAS interface is different from the CLBlast interface in that the former requires passing two temporary buffers, while the latter requires passing only one.

@CNugteren
To reproduce, please log into dividiti's Odroid XU3 and run e.g.:

$ LD_LIBRARY_PATH=/data/install/lib-clblast-0.6.0/lib:/data/install/lib-openblas-v0.2.18/lib:$LD_LIBRARY_PATH \
/data/caffe-dvdt-clblast#56/build/test/test_all.testbin \
--gtest_filter=NetTest/3.TestBackwardWithAccuracyLayer

Setting to use device 0
Note: Google Test filter = NetTest/3.TestBackwardWithAccuracyLayer
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from NetTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] NetTest/3.TestBackwardWithAccuracyLayer
F0505 23:44:21.248780 20262 greentea_math_functions.cpp:821] Check failed: static_cast<int>(status) == static_cast<int>(clblast::StatusCode::kSuccess) (-55 vs. 0) GREENTEA ERROR: CLBlast error
*** Check failure stack trace: ***
    @ 0xb6f4e060  (unknown)
    @ 0xb6f4df5c  (unknown)
    @ 0xb6f4db78  (unknown)
    @ 0xb6f4ff98  (unknown)
    @ 0xb5641dd6  caffe::greentea_gpu_dot<>()
    @ 0xb5686a02  caffe::SoftmaxWithLossLayer<>::Forward_gpu()
    @ 0xb556d694  caffe::Net<>::ForwardFromTo()
    @ 0xb556d8d8  caffe::Net<>::Forward()
    @   0x302c34  caffe::NetTest_TestBackwardWithAccuracyLayer_Test<>::TestBody_Impl()
    @   0x3e98dc  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @   0x3e4cfa  testing::Test::Run()
    @   0x3e4d8a  testing::TestInfo::Run()
    @   0x3e4e62  testing::TestCase::Run()
    @   0x3e5060  testing::internal::UnitTestImpl::RunAllTests()
    @   0x3e525c  testing::UnitTest::Run()
    @    0xe2f5c  main
    @ 0xb52cb632  (unknown)
Aborted

(Caffe built under the /data/caffe-dvdt-clblast#56 directory uses CLBlast for performing xDOT, which you can check with git diff in that directory.)

Unfortunately, I cannot test with the development branch due to #55.

Finally, as reported in #54, some DDOT tests fail but all the SDOT tests pass:

$ /data/build/lib-clblast-0.6.0/build-debug/clblast_test_xdot

Performance gemv vs gemm

Hello,

I encountered a weird runtime difference between the gemv and the gemm routine.
When I run both with the Input: M=4096, N=1, K=4096 on my GTX480 the runtime of the gemm routine is 3.04ms and the runtime of the gemv routine is 5.51ms. I would have expected that gemv would be faster than the gemm routine because it is made for such an input. Could it be that gemv isn't yet optimized for a GTX480 or is it normal that it is slower? The cuBLASSgemm is slower than cuBLASSgemv (almost 2 times faster).

I call the gemv routine like this:
./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up true -runs 100

Greetings,
Jan

Non-absolute level-1 routines

I know this is not a part of the BLAS standard, but these routines are often needed in practice. I am talking about xSUM, IxMAX, IxMIN.

Having in mind that these routines' implementation is a matter of one abs() call, what do you think of adding those to CLBlast, either through an aditional option in an already existing routine, or as an independent, matching, non-abs routine?

Test failure, xgemm, NVIDIA 940M

./clblast_test_xgemm
...
* Testing 'regular behaviour' for '1 (col-major) 0 (regular) 0 (regular)':
   ::::::::..::..::::::::::..::..::.....:.:.......:.....:.:.......:
   Pass rate 100.0%: 64 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '1 (col-major) 0 (regular) 1 (transposed)':
   ::::::::::::::::..::..::..::..::.....:.:.....:.:.......:.......X
   Error rate 0.0%: m=64 n=64 k=64 layout=1 transA=0 transB=1 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 
   Pass rate  98.4%: 63 passed / 0 skipped / 1 failed
* Testing 'regular behaviour' for '1 (col-major) 0 (regular) 2 (conjugate)':

Repeated SGEMM calls leak memory

Steps to reproduce:

Take sgemm.cpp from the samples. Put an infinite for() loop around the clblast::Gemm() + clWaitForEvents() calls. Launch the sample and watch it eventually use up all memory.

Context, programs, etc. initialization and cleanup

I am building a Clojure (JVM) library that uses CLBlast through JOCLBlast. The calling of CLBlast functions works without the need for setup and initialization.

That is great in many cases, especially during development. But, in other cases, there are many context and queue pointers that get created and released, particularly in a long-lived server application.

Now, CLBlast maintains its own cache under the hood, which holds pointers to all resources for all command queues that have been used since the start of the application. In applications that create and destroy queues often, CLBlast would collect a lot of garbage, and thus leak video memory.

Is there a way to handle this in the current CLBlast, and if not, is there a hope in waiting for the addition of initialization/cleanup function to CLBlast (I use it through plain C)? Unfortenately, my C-fu, and especially C++-fu, is not strong enough to implement it myself and contribute in that way.

New tuning results

(See the README for details)

This is the place to post new tuning results. If you compiled with -DTUNERS=ON, ran one of the tuners on your device (or all perhaps?), and feel that these results should be included in the next release of CLBlast, please post them here.

You can do this by attaching the JSON files to this issue (archived in a .ZIP file).

database.py default selection is suboptimal

9683b50#diff-f60ddfcd75f983a02970d3e99b6b81f4R89
9683b50#diff-f60ddfcd75f983a02970d3e99b6b81f4R260

If I read the source correctly, the default selection for a vendor currently takes the lowest value over all parameters in the best per-device kernels. This can generate defaults that are potentially bad on all devices, especially for "flag" parameters, or for sets of parameters that trade off against each other. It will get worse as more tuning is added.

I think you want to select the kernel that had the lowest average (or median) runtime over all devices, and didn't fail on any.

Some constants are different than in cblas

This might not surface in CLBlast directly, since Layout, Transpose, etc. are enums, but will bite users that use CLBlas from some other language/environment.

In cblas, row_major is 101, column_major 102.
in CLBlast, these are enums, but as numbers, I believe kRowMajor is 0, while kColMajor is 1 (at least they are when CLBlast is used from Java as an external library).

Is this intentional, and if not, would the right solution be to define Layout, Transpose, etc. to use the same constrants as cblas standard?

Severe SGEMM performance regression 0.7.1 -> 0.8.x/master

I'm seeing my entire application drop about 45% in performance when updating from 0.7.1 to the master branch. It's doing more than just sgemm, so the sgemm regression itself must be disastrous. This is on a R9 390.

On an Intel HD 530, the difference is even more than 50%.

Typical SGEMM kernel configurations:

M = 96
N = 361
K = 550

M = 128
N = 361
K = 1152

M = 128
N = 361
K = 864

Inline function not found with Vivante OpenCL

When running the test 'clblast_test_xher2k' the system doesn't find the opencl inline defined functions, we have an error:

Testing: n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 [CLBlast]OpenCL compiler error/warning:
(1245:0) : error : function: 'InitAccRegisters' hasn't the corresponding declaration
(1290:0) : error : function: 'MultiplyAccumulate' hasn't the corresponding declaration
(1339:0) : error : function: 'XgemmBody' hasn't the corresponding declaration
(1343:0) : error : function: 'StoreResults' hasn't the corresponding declaration
(1379:0) : error : function: 'XgemmBody' hasn't the corresponding declaration
(1383:0) : error : function: 'StoreResults' hasn't the corresponding declaration

We have tried to find where it does come from, but all the functions seem to be present in the openCL file, we have output the OpenCL code that is sent to the OpenCL compiler and those declaration are present, so we don't understand how to deal with it. Is there a way to avoid this issue ?

How to cite clblast?

Hi,

Writing a paper on cltorch, which I'll just put on arxiv most likely. Question: how to cite clblast?

Compiler segfaults when compiling the clients

Hello,

I want to test the performance for the CLBlastSgemm. For that I cloned the repo and created the build dir. After that I called: "cmake -DCMAKE_INSTALL_PREFIX=path/to//build/ -DCLIENTS=ON .." in the build dir to enable the perfomance tests. When I now want to do make install, I get the following output:

cmake -DCMAKE_INSTALL_PREFIX=/CLBlast/build/ -DCLIENTS=ON ..
-- Could NOT find 'clBLAS.h', install clBLAS or set CLBLAS_ROOT
-- Could NOT find clBLAS library, install it or set CLBLAS_ROOT
-- Could NOT find clBLAS (missing: CLBLAS_INCLUDE_DIRS CLBLAS_LIBRARIES)
-- Could NOT find 'cblas.h', install a CPU Netlib BLAS or set CBLAS_ROOT
-- Could NOT find a CPU Netlib BLAS library, install it or set CBLAS_ROOT
-- Could NOT find CBLAS (missing: CBLAS_INCLUDE_DIRS CBLAS_LIBRARIES)
-- Could NOT find clBLAS nor a CPU BLAS, head-to-head performance comparison not supported in the clients
-- Configuring done
-- Generating done
-- Build files have been written to:/CLBlast/build
make install
Scanning dependencies of target clblast
[ 1%] Building CXX object CMakeFiles/clblast.dir/src/database/database.cpp.o
[ 2%] Building CXX object CMakeFiles/clblast.dir/src/routines/common.cpp.o
[ 3%] Building CXX object CMakeFiles/clblast.dir/src/cache.cpp.o
[ 4%] Building CXX object CMakeFiles/clblast.dir/src/clblast.cpp.o
[ 5%] Building CXX object CMakeFiles/clblast.dir/src/clblast_c.cpp.o
[ 6%] Building CXX object CMakeFiles/clblast.dir/src/routine.cpp.o
[ 7%] Building CXX object CMakeFiles/clblast.dir/src/utilities.cpp.o
[ 8%] Building CXX object CMakeFiles/clblast.dir/src/routines/level1/xswap.cpp.o
[ 10%] Building CXX object CMakeFiles/clblast.dir/src/routines/level1/xscal.cpp.o
[ 11%] Building CXX object CMakeFiles/clblast.dir/src/routines/level1/xcopy.cpp.o
[ 12%] Building CXX object CMakeFiles/clblast.dir/src/routines/level1/xaxpy.cpp.o
[ 13%] Building CXX object CMakeFiles/clblast.dir/src/routines/level1/xdot.cpp.o
[ 14%] Building CXX object CMakeFiles/clblast.dir/src/routines/level1/xdotu.cpp.o
[ 15%] Building CXX object CMakeFiles/clblast.dir/src/routines/level1/xdotc.cpp.o
[ 16%] Building CXX object CMakeFiles/clblast.dir/src/routines/level1/xnrm2.cpp.o
[ 17%] Building CXX object CMakeFiles/clblast.dir/src/routines/level1/xasum.cpp.o
[ 18%] Building CXX object CMakeFiles/clblast.dir/src/routines/level1/xamax.cpp.o
[ 20%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xgemv.cpp.o
[ 21%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xgbmv.cpp.o
[ 22%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xhemv.cpp.o
[ 23%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xhbmv.cpp.o
[ 24%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xhpmv.cpp.o
[ 25%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xsymv.cpp.o
[ 26%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xsbmv.cpp.o
[ 27%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xspmv.cpp.o
[ 28%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xtrmv.cpp.o
[ 30%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xtbmv.cpp.o
[ 31%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xtpmv.cpp.o
[ 32%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xger.cpp.o
[ 33%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xgeru.cpp.o
[ 34%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xgerc.cpp.o
[ 35%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xher.cpp.o
[ 36%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xhpr.cpp.o
[ 37%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xher2.cpp.o
[ 38%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xhpr2.cpp.o
[ 40%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xsyr.cpp.o
[ 41%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xspr.cpp.o
[ 42%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xsyr2.cpp.o
[ 43%] Building CXX object CMakeFiles/clblast.dir/src/routines/level2/xspr2.cpp.o
[ 44%] Building CXX object CMakeFiles/clblast.dir/src/routines/level3/xgemm.cpp.o
[ 45%] Building CXX object CMakeFiles/clblast.dir/src/routines/level3/xsymm.cpp.o
[ 46%] Building CXX object CMakeFiles/clblast.dir/src/routines/level3/xhemm.cpp.o
[ 47%] Building CXX object CMakeFiles/clblast.dir/src/routines/level3/xsyrk.cpp.o
[ 48%] Building CXX object CMakeFiles/clblast.dir/src/routines/level3/xherk.cpp.o
[ 50%] Building CXX object CMakeFiles/clblast.dir/src/routines/level3/xsyr2k.cpp.o
[ 51%] Building CXX object CMakeFiles/clblast.dir/src/routines/level3/xher2k.cpp.o
[ 52%] Building CXX object CMakeFiles/clblast.dir/src/routines/level3/xtrmm.cpp.o
[ 53%] Building CXX object CMakeFiles/clblast.dir/src/routines/levelx/xomatcopy.cpp.o
Linking CXX shared library libclblast.so
[ 53%] Built target clblast
Scanning dependencies of target test_performance_common
[ 54%] Building CXX object CMakeFiles/test_performance_common.dir/test/performance/client.cpp.o
In file included from /CLBlast/test/performance/client.cpp:14:0:
/CLBlast/test/performance/client.hpp: In function ‘void clblast::RunClient(int, char*)’:
/CLBlast/test/performance/client.hpp:104:23: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See http://gcc.gnu.org/bugs.html for instructions.
make[2]: *
* [CMakeFiles/test_performance_common.dir/test/performance/client.cpp.o] Error 1
make[1]: *** [CMakeFiles/test_performance_common.dir/all] Error 2
make: *** [all] Error 2
`

is there a way to fix this problem? Or is there a way to get the gemm kernel runtime of clBlast without the performance tests of clBlast?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.