It may give you a basic understanding about the performance. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust. General multi-reduction can be done by thrust::reduction_by_key in a few minutes. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by This may improve the performance more, especially for matrices with bad shape. The idea is to reduce 1 row per thread block. One thread block reduces from result of stage 1 to the final 1 element.įor your multi-reduction (reduce rows of mat) problem, only stage 1 is enough.Multiple thread blocks each reduces one part of the data.Standard reduction can be divide into 2 stages. Here's a very good doc introducing how to optimize standard parallel reduction. cuBLAS approach works only with sum and have good performance. kernel approach may have the highest performance. Since you mentioned you need general reduction algorithm other than sum only. EDIT 2Īs suggest by users other users and here endorsed:įORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes. That said, it doesn't allow me to exploit the cuBLAS multiply by 1's COL column vector trick, as suggested by some commentators. The serial version of code is below (it has 2 loops, as expected): void serial_rowSum (float* m, float* output, int nrow, int ncol) values, like rowAND/ rowOR. The matrix has uni-dimensional representation (pointer to a float). I wrote a simple CUDA code which calculates the row sums of a matrix.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |