xuqiantong / cuda-winograd Goto Github PK
View Code? Open in Web Editor NEWFast CUDA Kernels for ResNet Inference.
Fast CUDA Kernels for ResNet Inference.
At first, the weight data that been generated is by GWGT, and loaded into a variable that named "weights". The feature map is preprocessed by function "kernel_128_winograd_BtdB()". Both of the two matrices are not transposed in the end.
Within the function "kernel_128_OuterProduct_128()", the MM operation is completed by below sentences:
for (int j = 0; j < 32; j++) {
sum += input[y_tmp + j] * kernel[tX + B_stride[j]];
}
out[tY*128 + tX] += sum;
__syncthreads();
The iterator "k" divides the weights matrix to (32, 128), and at each iteration, the input matrix has shape (8*128).
The B_stride[j] is constructed with respect to 128*j, pointing to the first element of each row.
The "y_tmp" is defined by "tY*128 + k*32", which points to the element of each row and further into each 32 elements, because k divides weights into 4*32*128.
Hence we can see that, this is an inner-product. For a specific thread with index of (tY, tX), it generates one elements in result matrix with index of (tY*128+tX). And this result is generated by summing up all the multiplication of each corresponding elements in a row from input, and corresponding elements in a column from weights.
Thanks,
65 float *input = get_parameter(inputName128one, 14*14*512);
66 float *weight = get_parameter(weightName128one, 128*512);
99 cudaMemcpy(input_, input, nInput<<2, cudaMemcpyHostToDevice);
100 cudaMemcpy(weight_, weight_, nWeights<<2, cudaMemcpyHostToDevice);
101 cudaMemcpy(bnBias_, bnBias_myKernel, 128<<2, cudaMemcpyHostToDevice);
102 cudaMemcpy(bnScale_, bnScale_myKernel, 128<<2, cudaMemcpyHostToDevice);
Generated weight data is loaded at line 66 with ptr name weight, but at line 100, you assign an initialized position to device pointer weight_, which will be all zero.
cudnn part also uses weight_ to compute, so I changed line 100 to:
100 cudaMemcpy(weight_, weight, nWeights<<2, cudaMemcpyHostToDevice);
Then I got this result:
---- Iter: 9 ----
TotalTime = 58 us
cudaSuccess
cuDNN TotalTime = 113 us
cudaSuccess
[max_error: 65141.292969][error_cnt: 18947]
Average Total Time: [Mine: 59 us], [cuDNN: 111 us]
Seems your kernel is not doing the correct calculation.
Please help address the issue, thanks.
Hi @xuqiantong . I'm reading your kernel function code kernel_128_winograd_AtIA
in the file Kernel128_winograd.cu
.
I'm confused with some lines of code with my comment below. The question is labeled Q1,Q2,Q3,Q4.
// Q1: what the Tilex == 3 && Inx > 1 responsible for ?
// I know Inx > 3, because the result dimension is 4 *4
if (Inx > 3 || (Tilex == 3 && Inx > 1)) return;
int x;
float o;
switch(Iny) {
case 0:
x = Inx*6;
o = scale*(input[x]+input[x+1]+input[x+2]+input[x+3]+input[x+4])+ bias;
// Q2: even Tilex = 0, Tiley = 0, Inx = 0, Iny = 0, kz = 0, the result is stored in index 2176, not 0, why ?
// what the value stored in index [0~2175] ?
pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 1 ) * 128 + kz] = o > 0 ? o : 0;
break;
case 1:
x = Inx*6;
o = scale*(input[x+1] - input[x+2] + 2*input[x+3] - 2*input[x+4]) + bias;
pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 2 ) * 128 + kz] = o > 0 ? o : 0;
break;
case 2:
// Q3: what the special of Tiley = 3 ?
if (Tiley == 3) break;
x = Inx*6;
o = scale*(input[x+1] + input[x+2] + 4*input[x+3] + 4*input[x+4]) + bias;
pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 3 ) * 128 + kz] = o > 0 ? o : 0;
break;
case 3:
// Q4: same as Q3
if (Tiley == 3) break;
x = Inx*6;
o = scale*(input[x+1] - input[x+2] + 8*input[x+3] - 8*input[x+4] + input[x+5]) + bias;
pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 4 ) *128 + kz] = o > 0 ? o : 0;
break;
}
Could you explain about those question ?
Can you please add a license to your repository ?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.