GithubHelp home page GithubHelp logo

Comments (12)

MoZeWei avatar MoZeWei commented on July 20, 2024

Also, is there any software I need to install? I had rocm-4.3.0 before, but I didn't have rocblas-4.3.0. So I compiled this code after I installed rocblas-4.3.0 and rocsolver-4.3.0, are there any softwares I didn't install which cause such a problem?

from hipblas.

MoZeWei avatar MoZeWei commented on July 20, 2024

Here is a simple reproducation(headers included are the same as those in code.zip):

#define HIP_ASSERT(x) (assert((x)==hipSuccess))



#define HIP_CHECK(x)                        \
    if(x != hipSuccess)                     \
    {                                       \
        std::cout<<"Wrong HIP"<<std::endl;  \
        exit(1);                            \
    }

#define HIP_BLAS_CHECK(x)                           \
    if(x != HIPBLAS_STATUS_SUCCESS)                 \
    {                                               \
        std::cout<<"Wrong HIP BLAS: ";              \
        exit(1);                                    \
    }

#define HIP_LASTERROR()                     \
    {                                       \
        std::cout<<"ERROR: "<<hipGetErrorString(hipGetLastError())<<std::endl;  \
    }

char *ReadFile(const std::string &filePath, size_t &fileSize, void *buffer, size_t bufferSize)
{
    struct stat sBuf;
    int fileStatus = stat(filePath.data(), &sBuf);
    if (fileStatus == -1) {
        std::cout<<"failed to get file"<<std::endl;
        return nullptr;
    }
    if (S_ISREG(sBuf.st_mode) == 0) {
        std::cout<<filePath.c_str()<<" is not a file, please enter a file"<<std::endl;
        return nullptr;
    }

    std::ifstream file;
    file.open(filePath, std::ios::binary);
    if (!file.is_open()) {
        std::cout<<"Open file failed. path "<<filePath.c_str()<<std::endl;
        return nullptr;
    }

    std::filebuf *buf = file.rdbuf();
    size_t size = buf->pubseekoff(0, std::ios::end, std::ios::in);
    if (size == 0) {
        std::cout<<"file size is 0"<<std::endl;
        file.close();
        return nullptr;
    }
    if (size > bufferSize) {
        std::cout<<"file size = "<<size<<" is larger than buffer size = "<<bufferSize<<std::endl;
        file.close();
        return nullptr;
    }
    buf->pubseekpos(0, std::ios::in);
    buf->sgetn(static_cast<char *>(buffer), size);
    fileSize = size;
    file.close();
    return static_cast<char *>(buffer);
}

bool WriteFile(const std::string &filePath, const void *buffer, size_t size)
{
    if (buffer == nullptr) {
        //ERROR_LOG("Write file failed. buffer is nullptr");
        return false;
    }

    FILE *outputFile = fopen(filePath.c_str(), "wb");
    if (outputFile == nullptr) {
        //ERROR_LOG("Open file failed. path = %s", filePath.c_str());
        return false;
    }

    fwrite(buffer, size, sizeof(char), outputFile);
    fclose(outputFile);

    return true;
}


int main(int argc, char** argv) {


    //size_t M = 1024;
    //size_t K = 1024;
    //size_t N = 1024;
    size_t M = 8192;
    size_t K = 8192;
    size_t N = 8192;

    //host data
    size_t fp16_bytes = sizeof(hipblasHalf) * M * N;
    size_t fp32_bytes = sizeof(float) * M * N;
    unsigned short * fp16_h_A = (unsigned short *)malloc(fp16_bytes);
    unsigned short * fp16_h_B = (unsigned short *)malloc(fp16_bytes);
    unsigned short * fp16_h_C = (unsigned short *)malloc(fp16_bytes);
    float * fp32_h_C = (float*)malloc(fp32_bytes);

    //device data
    //fp32 
    float * fp32_d_C;

    //fp16
    hipblasHalf * fp16_d_A;
    hipblasHalf * fp16_d_B;
    hipblasHalf * fp16_d_C;



    std::cout<<"Begin to malloc"<<std::endl;


    HIP_ASSERT(hipMalloc((void**)&fp16_d_A, fp16_bytes));
    HIP_ASSERT(hipMalloc((void**)&fp16_d_B, fp16_bytes));
    HIP_ASSERT(hipMalloc((void**)&fp16_d_C, fp16_bytes));
    HIP_ASSERT(hipMalloc((void**)&fp32_d_C, fp32_bytes));
    
    HIP_LASTERROR();

    double msecPerMatrixMul[3] = {0, 0, 0};
    double gigaFlops[3] = {0, 0, 0};
    double flopsPerMatrixMul = 2.0 * M * N * K;

    size_t fileSize;
    char * fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp16matrix_a.bin",fileSize,fp16_h_A,fp16_bytes);
    if(fileData == nullptr)
    {
        std::cout<<"matrix A read file fail"<<std::endl;
        exit(1);
    }
    fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp16matrix_b.bin",fileSize,fp16_h_B,fp16_bytes);
    if(fileData == nullptr)
    {
        std::cout<<"matrix B read file fail"<<std::endl;
        exit(1);
    }
    fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp16matrix_c.bin",fileSize,fp16_h_C,fp16_bytes);
    if(fileData == nullptr)
    {
        std::cout<<"matrix C read file fail"<<std::endl;
        exit(1);
    }
    fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp32matrix_c.bin",fileSize,fp32_h_C,fp32_bytes);
    if(fileData == nullptr)
    {
        std::cout<<"matrix B read file fail"<<std::endl;
        exit(1);
    }

    HIP_ASSERT(hipMemcpy(fp16_d_A,fp16_h_A,fp16_bytes,hipMemcpyHostToDevice));
    HIP_ASSERT(hipMemcpy(fp16_d_B,fp16_h_B,fp16_bytes,hipMemcpyHostToDevice));
    //HIP_ASSERT(hipMemcpy(fp16_d_C,fp16_h_C,fp16_bytes,hipMemcpyHostToDevice));
    HIP_ASSERT(hipMemcpy(fp32_d_C,fp32_h_C,fp32_bytes,hipMemcpyHostToDevice));

    std::cout<<"original host fp16 Matrix 100th element"<<(fp16_h_A[100])<<std::endl;
    HIP_ASSERT(hipMemcpy(fp16_h_A,fp16_d_A,fp16_bytes,hipMemcpyDeviceToHost));
    std::cout<<"translated host fp16 Matrix 100th element"<<(fp16_h_A[100])<<std::endl;

    HIP_LASTERROR();

    hipEvent_t start,stop;
    hipEventCreate(&start);
    hipEventCreate(&stop);
    float msecTotal = 0;
    int nIter = 1;
    std::cout<<"begin GEMM"<<std::endl;
	
    // FP32 GEMM
    hipblasHandle_t blas_handle;  
    hipblasCreate(&blas_handle);

    //FP16
    float  fp32_alpha = 1; 
    float fp32_beta = 1;
    hipblasHandle_t fp16blas_handle;  
    hipblasCreate(&fp16blas_handle);
    hipblasHalf fp16_alpha = static_cast<hipblasHalf>(1);
    hipblasHalf fp16_beta = static_cast<hipblasHalf>(1);
    hipblasHalf * device_fp16_alpha = nullptr;
    hipblasHalf * device_fp16_beta = nullptr;
    HIP_ASSERT(hipMalloc((void**)&device_fp16_alpha, sizeof(hipblasHalf)));
    HIP_ASSERT(hipMalloc((void**)&device_fp16_beta, sizeof(hipblasHalf)));
    HIP_ASSERT(hipMemcpy(device_fp16_beta, &fp16_beta,sizeof(hipblasHalf),hipMemcpyHostToDevice));
    HIP_ASSERT(hipMemcpy(device_fp16_alpha, &fp16_alpha,sizeof(hipblasHalf),hipMemcpyHostToDevice));

    HIP_LASTERROR();
    
    HIP_BLAS_CHECK(hipblasHgemm(fp16blas_handle,HIPBLAS_OP_T, HIPBLAS_OP_T,
					            M,N,K,device_fp16_alpha,fp16_d_A,M,fp16_d_B,K,device_fp16_beta,
					            fp16_d_C,M));
    hipDeviceSynchronize();
    
    HIP_LASTERROR();

    hipEventRecord(start);
    for (int run = 0 ; run < nIter; run ++ ) {
        /*
        //fp16->fp16
		HIP_BLAS_CHECK(hipblasHgemm(fp16blas_handle,HIPBLAS_OP_T, HIPBLAS_OP_T,
					                M,N,K,device_fp16_alpha,fp16_d_A,M,fp16_d_B,K,device_fp16_beta,
					                fp16_d_C,M));
        */
        HIP_BLAS_CHECK(hipblasGemmEx(blas_handle, HIPBLAS_OP_T, HIPBLAS_OP_T,
                                    M,N,K, &fp32_alpha,
                                    fp16_d_A, HIPBLAS_R_16F, M,
                                    fp16_d_B, HIPBLAS_R_16F, K,
                                    &fp32_beta, fp32_d_C, HIPBLAS_R_32F, M,
                                    HIPBLAS_R_16F, HIPBLAS_GEMM_DEFAULT));
		
	}
    hipEventRecord(stop);
    hipEventSynchronize(stop);
    hipEventElapsedTime(&msecTotal, start, stop);
    
    HIP_LASTERROR();
    
    msecPerMatrixMul[1] = msecTotal / nIter;
    gigaFlops[1] = (flopsPerMatrixMul * 1.0e-9f) / (msecPerMatrixMul[1] / 1000.0f);
    printf( "CuBlas FP16 Performance= %.2f GFlop/s, Time= %.3f msec, Size= %.0f Ops,\n",
        gigaFlops[1],
        msecPerMatrixMul[1],
        flopsPerMatrixMul);



    hipblasDestroy(blas_handle); 
    hipblasDestroy(fp16blas_handle); 

    // Free Memory
    hipFree(fp16_d_A);
    hipFree(fp16_d_B);
    hipFree(fp16_d_C);

    
    free(fp16_h_A);
    free(fp16_h_B);
    free(fp16_h_C);
}

from hipblas.

MoZeWei avatar MoZeWei commented on July 20, 2024

I implemented a cpu-version function to check the error, but I found the error between GemmEx of fp16 and self-implemented fp16gemm is too big. I cannot figure out why, please provide some suggestions, thank you!
10_4 checking.zip

from hipblas.

daineAMD avatar daineAMD commented on July 20, 2024

Hi there, thanks for raising this issue.

It seems to me that the problems lies with alpha/beta initialization. When running the sample with ROCBLAS_LAYER=2 it appears that both alpha and beta are not equal to 1 as expected, but rather 0.

I changed alpha and beta in the zipped code to be of type _Float16, casting to const hipblasHalf* when passing into hipblasGemmEx, and it seemed to work as expected. Please let me know if this solution works for you as well.

Thanks,
Daine

from hipblas.

MoZeWei avatar MoZeWei commented on July 20, 2024

Hi there, thanks for raising this issue.

It seems to me that the problems lies with alpha/beta initialization. When running the sample with ROCBLAS_LAYER=2 it appears that both alpha and beta are not equal to 1 as expected, but rather 0.

I changed alpha and beta in the zipped code to be of type _Float16, casting to const hipblasHalf* when passing into hipblasGemmEx, and it seemed to work as expected. Please let me know if this solution works for you as well.

Thanks, Daine

Thanks for your reply. In fact, I found out that if I called GemmEx of fp16/fp16 with float alpha and beta as well as both HIPBLAS_OP_N, the result is right and performance is like 130 TFLOPS under problem size of m=n=k=16384. But if without satisfying above condition, the performance varies a lot.
image

An experiment calculating average per-time performance of running GEMM 100 times under different settings.

And also, like your solution, I tried the code below a long time ago:

    
    hipblasHalf * device_fp16_alpha = nullptr;
    hipblasHalf * device_fp16_beta = nullptr;
    HIP_ASSERT(hipMalloc((void**)&device_fp16_alpha, sizeof(hipblasHalf)));
    HIP_ASSERT(hipMalloc((void**)&device_fp16_beta, sizeof(hipblasHalf)));
    HIP_ASSERT(hipMemcpy(device_fp16_beta, &fp16_beta,sizeof(hipblasHalf),hipMemcpyHostToDevice));
    HIP_ASSERT(hipMemcpy(device_fp16_alpha, &fp16_alpha,sizeof(hipblasHalf),hipMemcpyHostToDevice));
    
    HIP_LASTERROR();
    float Peak_ms = 1000000;
    for (int run = 0 ; run < nIter; run ++ ) {
        
        //fp16->fp16
		HIP_BLAS_CHECK(hipblasHgemm(fp16blas_handle,HIPBLAS_OP_T, HIPBLAS_OP_T,
					                M,N,K,device_fp16_alpha,fp16_d_A,M,fp16_d_B,K,device_fp16_beta,
					                fp16_d_C,M));
     }

It doesn't produce right result too. In my opinion, this code should work like your solution. I used to transport the alpha/beta value on device to host and check their value, they are 1.0, but when passing to hipblasGemm, they become 0. Or maybe can you provide a small code example to reproduce what you said? Thank you.

So I will really appreciate that AMD can provide a detailed guide to use hipblasGemm, because as the way I see, these problems are caused by the non-full software support in rocBlas.

from hipblas.

MoZeWei avatar MoZeWei commented on July 20, 2024

I will make a tool library today to run all these test in a more convenient way, which combines running Gemm and result checking function. Maybe in this way we can produce a table telling people how we should use hipblasGEMM in a right way.

from hipblas.

TorreZuk avatar TorreZuk commented on July 20, 2024

Just some comments, your code pasted above seems buggy:

It passes device pointers as arguments for alpha/beta (addresses on GPU) without setting:
status = hipblasSetPointerMode(fp16blas_handle, HIPBLAS_POINTER_MODE_DEVICE);
as the default would be as if you set:
status = hipblasSetPointerMode(handle, HIPBLAS_POINTER_MODE_HOST);
You seem to initialize two handles differently and not clear if you are using the one you intended, simpler to use
one handle and reconfigure it if required for the secondary call as you are in a single thread.

Your numeric anlaysis code for f32 notes row and col major but divides by a different indexing scheme, maybe you intended column major and made a mistake. Nor is this stable and useful where test data approaches zero.
float cur_err = cm_matrix[jm+i] - rm_matrix[in+j];
if(abs(cur_err) > abs(max_err)) max_err = abs(cur_err);
cur_err /= rm_matrix[i*m+j]; << here the indexing is not row nor col major

Thus please try using a single handle configured for how you will pass arguments (host or device mode) and compare results against CPU reference code of either LAPACK, BLIS, or OpenBLAS which are ones we test against. This can rule out bugs in your kernel implementation. You made a good point we should have more examples, note some additional examples for the underlying rocBLAS library are at: https://github.com/ROCmSoftwarePlatform/rocBLAS-examples also the explanation of handles from rocBLAS is at https://rocblas.readthedocs.io/en/rocm-4.3.1/

from hipblas.

MoZeWei avatar MoZeWei commented on July 20, 2024

Just some comments, your code pasted above seems buggy:

It passes device pointers as arguments for alpha/beta (addresses on GPU) without setting: status = hipblasSetPointerMode(fp16blas_handle, HIPBLAS_POINTER_MODE_DEVICE); as the default would be as if you set: status = hipblasSetPointerMode(handle, HIPBLAS_POINTER_MODE_HOST); You seem to initialize two handles differently and not clear if you are using the one you intended, simpler to use one handle and reconfigure it if required for the secondary call as you are in a single thread.

Your numeric anlaysis code for f32 notes row and col major but divides by a different indexing scheme, maybe you intended column major and made a mistake. Nor is this stable and useful where test data approaches zero. float cur_err = cm_matrix[j_m+i] - rm_matrix[i_n+j]; if(abs(cur_err) > abs(max_err)) max_err = abs(cur_err); cur_err /= rm_matrix[i*m+j]; << here the indexing is not row nor col major

Thus please try using a single handle configured for how you will pass arguments (host or device mode) and compare results against CPU reference code of either LAPACK, BLIS, or OpenBLAS which are ones we test against. This can rule out bugs in your kernel implementation. You made a good point we should have more examples, note some additional examples for the underlying rocBLAS library are at: https://github.com/ROCmSoftwarePlatform/rocBLAS-examples also the explanation of handles from rocBLAS is at https://rocblas.readthedocs.io/en/rocm-4.3.1/

Thank you for your reply. Right, your suggestion about pointer setting is something I missed before, thank you for that. As for the cur_err /= rm_matrix[i*m+j]; << here the indexing is not row nor col major, considering in my cases M = N = K is always true, so this does not matter.

As for this

Nor is this stable and useful where test data approaches zero.

I totally agree with you, and I implemented the version without using abs(), it works fine.

I need some time to work on the whole new test. Thank you.

from hipblas.

daineAMD avatar daineAMD commented on July 20, 2024

I've added an hgemm example at example_hgemm.cpp which can be compiled with hipcc; I hope that helps as well.

from hipblas.

MoZeWei avatar MoZeWei commented on July 20, 2024

I've added an hgemm example at example_hgemm.cpp which can be compiled with hipcc; I hope that helps as well.

Thank you so much for that. It did help me.

from hipblas.

daineAMD avatar daineAMD commented on July 20, 2024

Hi @MoZeWei, just wondering if your problem is resolved so I can go ahead with closing this issue, thanks.

from hipblas.

MoZeWei avatar MoZeWei commented on July 20, 2024

@daineAMD Thanks for your reply. Due to the change of env of my server, this problem should not be discussed anymore.

from hipblas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.