merinjo / parallel-matrix-multiply Goto Github PK
View Code? Open in Web Editor NEWIn this parallel strategy, rows and columns were partitioned into 4 so matrix A and B was partitioned into 4*4 block matrices. The tile width would be 512/4 = 128. At a time one block matrix A and B was brought to shared memory and all the threads computed on those data. Each thread would work on 128/64 = 2 columns of the block of matrix B. In parallel 64 threads would be working on a block at a time. The block sequence would be: First block of first row of C = first block of first row of c + (first block of first row of A * first block of first column of B) + (second block of first row of A * second block of first column of B) + โฆ (fourth block of first row of A * fourth block of first column of B). This technique exploits both spatial and temporal locality of data since reuse of adjacent dataset and reuse of same dataset.
License: MIT License