First let me say that <a href="https://developer.nvidia.com/blog/faster-parallel-reduc

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Cannot reproduce the results on parallel reduce with shfl,about nvidia-developer-blog/code-samples

Comments (7)

harrism commented on June 3, 2024

Hi @rwbfd, the post does not use CUB, so I'm not sure how your compilation errors with CUB are relevant to the post?

from code-samples.

lix19937 commented on June 3, 2024

@rwbfd nvcc -O3 main.cu -o reduce -arch=sm_35
and cub version is cub-93696c4bce447b71c4bd0b25d1e26f1247341c04 https://github.com/NVLabs/cub/tree/93696c4bce447b71c4bd0b25d1e26f1247341c04

from code-samples.

harrism commented on June 3, 2024

Notice the About notice on that page, which indicates that you are looking at a very old version of CUB. CUB is now part of the CUDA Toolkit, and lives here: https://github.com/NVIDIA/cub

from code-samples.

lix19937 commented on June 3, 2024

@harrism harrism teacher, I want to confirm that:
Question 1:
Whether two threads which from two warps access two differenct addresses in the same share memory bank will arise bank conflicts or not ?

Question 2:
If Bank conflicts from different warps is exist, Bank conflicts from different warps will not cause serious latency and can be ignored is right ?

from code-samples.

lix19937 commented on June 3, 2024

From ncu, Memory Workload Analysis, total bank conflicts is 6

const int BLOCK_DIM{32};

// grid(2,1)  block(32, 1) <<<grid, block>>>
template <typename T = float>
__global__ void kernel2(const T* in, T* out) {
  __shared__ T shm[BLOCK_DIM * 2 * 4];

  auto tid = (blockIdx.y * gridDim.x + blockIdx.x) * (blockDim.x * blockDim.y) + threadIdx.y * blockDim.x + threadIdx.x;
  // printf("blockid %d  tid %d\n", blockIdx.x, tid);
  shm[tid] = in[tid];

  __syncthreads();
  out[tid] = shm[tid*4];
}

template <typename T= float>
int transpose(const T* in, T* out) {
  dim3 grid(2, 1);
  dim3 block(BLOCK_DIM, 1);

  kernel2<T><<<grid, block>>>(in, out);

  CheckCudaErrors(cudaPeekAtLastError());
  CheckCudaErrors(cudaDeviceSynchronize());
  return 0;
}

tid	bank0 addr
0	0
8	32
16	64
24	96

tid0, tid8, tid16, tid24 are landed at different memory addresses in the same bank(bank0) , so it has bank conflict, I think the number is 3.

The same case in follow threads in a warp:
tid1, tid9, tid17, tid25 , tid2, tid10, tid18, tid26, ... tid7, tid15, tid23, tid31

Why From ncu, Memory Workload Analysis, total bank conflicts is 6 ?
How to get this number ? @harrism

from code-samples.

harrism commented on June 3, 2024

@lix19937 Github issues are not a help forum. Please ask your questions on stack overflow or https://forums.developer.nvidia.com/c/accelerated-computing/cuda/cuda-programming-and-performance/7

from code-samples.

lix19937 commented on June 3, 2024

@harrism Much thanks !

from code-samples.

Recommend Projects

Cannot reproduce the results on parallel reduce with shfl about code-samples HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs