GithubHelp home page GithubHelp logo

cuda-winograd's People

Contributors

xuqiantong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cuda-winograd's Issues

Seems Kernel128_winograd.cu is using inner-product, not outer-product

At first, the weight data that been generated is by GWGT, and loaded into a variable that named "weights". The feature map is preprocessed by function "kernel_128_winograd_BtdB()". Both of the two matrices are not transposed in the end.

Within the function "kernel_128_OuterProduct_128()", the MM operation is completed by below sentences:

		for (int j = 0; j < 32; j++) {
			sum += input[y_tmp + j] * kernel[tX + B_stride[j]];
		}
		out[tY*128 + tX] += sum;
		__syncthreads();

The iterator "k" divides the weights matrix to (32, 128), and at each iteration, the input matrix has shape (8*128).
The B_stride[j] is constructed with respect to 128*j, pointing to the first element of each row.
The "y_tmp" is defined by "tY*128 + k*32", which points to the element of each row and further into each 32 elements, because k divides weights into 4*32*128.

Hence we can see that, this is an inner-product. For a specific thread with index of (tY, tX), it generates one elements in result matrix with index of (tY*128+tX). And this result is generated by summing up all the multiplication of each corresponding elements in a row from input, and corresponding elements in a column from weights.

Thanks,

Trying to fix the bug in Kernel128_one.cu.

 65   float *input = get_parameter(inputName128one, 14*14*512);
 66   float *weight = get_parameter(weightName128one, 128*512);

 99   cudaMemcpy(input_, input, nInput<<2, cudaMemcpyHostToDevice);
100   cudaMemcpy(weight_, weight_, nWeights<<2, cudaMemcpyHostToDevice);
101   cudaMemcpy(bnBias_, bnBias_myKernel, 128<<2, cudaMemcpyHostToDevice);
102   cudaMemcpy(bnScale_, bnScale_myKernel, 128<<2, cudaMemcpyHostToDevice);

Generated weight data is loaded at line 66 with ptr name weight, but at line 100, you assign an initialized position to device pointer weight_, which will be all zero.

cudnn part also uses weight_ to compute, so I changed line 100 to:
100 cudaMemcpy(weight_, weight, nWeights<<2, cudaMemcpyHostToDevice);
Then I got this result:

---- Iter: 9 ----
TotalTime = 58 us
cudaSuccess
cuDNN TotalTime = 113 us
cudaSuccess
[max_error: 65141.292969][error_cnt: 18947]
Average Total Time: [Mine: 59 us], [cuDNN: 111 us]

Seems your kernel is not doing the correct calculation.

Please help address the issue, thanks.

about pOutputs index questions in Kernel128_winograd.cu

Hi @xuqiantong . I'm reading your kernel function code kernel_128_winograd_AtIA in the file Kernel128_winograd.cu.

I'm confused with some lines of code with my comment below. The question is labeled Q1,Q2,Q3,Q4.

// Q1: what the Tilex == 3 && Inx > 1 responsible for ?
// I know Inx > 3, because the result dimension is 4 *4 
if (Inx > 3 || (Tilex == 3 && Inx > 1)) return;



int x;
float o;
switch(Iny) {
case 0:
       x = Inx*6;
       o = scale*(input[x]+input[x+1]+input[x+2]+input[x+3]+input[x+4])+ bias;
       // Q2: even Tilex = 0, Tiley = 0, Inx = 0, Iny = 0, kz = 0, the result is stored in index 2176, not 0, why ?
       // what the value stored in index [0~2175] ? 
       pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 1 ) * 128 + kz] = o > 0 ? o : 0;
       break;
case 1:
        x = Inx*6;
	o = scale*(input[x+1] - input[x+2] + 2*input[x+3] - 2*input[x+4]) + bias;
	pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 2 ) * 128 + kz] = o > 0 ? o : 0;
        break;
case 2:
        // Q3: what the special of Tiley = 3 ?
	if (Tiley == 3) break;
	x = Inx*6;
	o = scale*(input[x+1] + input[x+2] + 4*input[x+3] + 4*input[x+4]) + bias;
	pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 3 ) * 128 + kz] = o > 0 ? o : 0;
	break;
case 3:
        // Q4: same as Q3
	if (Tiley == 3) break;
	x = Inx*6;
	o = scale*(input[x+1] - input[x+2] + 8*input[x+3] - 8*input[x+4] + input[x+5]) + bias;
	pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 4 ) *128 + kz] = o > 0 ? o : 0;
	break;
}

Could you explain about those question ?

license

Can you please add a license to your repository ?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.