Squeezing performance out of CUDA

GPU code is known for being capable of extremely fast computation, but such performance only comes with well crafted code.

The following are the iterations I went through to squeeze performance out of a CUDA kernel for matrix multiplication in CSR format. Testing was done on a Quadro FX 5800 with CUDA 3.2 and involved a matrix multiplication of a large 52000x52000 sparse matrix with about 1% non-zero entries. All timings listed here are imprecise, and only meant to give an idea of relative performance.

Initial (naive) implementation

__global__ void cuda_mat_mul_kernel(const int *rows,
                                    const int *cols,
                                    const scalar_t *data,
                                    const scalar_t *v,
                                    scalar_t *dest,
                                    int N){
  int row = threadIdx.x + blockIdx.x * blockDim.x;
  if(row >= N)
     return;
  int row_start = rows[row];
  int row_end   = rows[row+1];
  float sum = 0.0f;
  for(int i = row_start; i < row_end; i++)
     sum += data[i] * v[cols[i]];
  dest[row] = sum;
}

This is a pretty straightforward implementation, similar to what one would write as a serial CPU implementation. It uses one thread per row, each of which simply computes the dot product and writes it back to global. This kernel takes about 100 milliseconds to run, several times slower than the equivalent CPU code.

Memory coalescing

Perhaps the most important factor in optimizing CUDA performance is to allow memory coalescing when accessing global memory. This is possible when sequential words of memory are accessed by sequential threads in a warp (thread 0 reads word 0, thread 1 reads word 1, etc.). This implementation achieves this by having multiple threads per row, which access words at an offset equal to their thread number within the block. The partial sum of each thread is then stored in per-block shared memory, which is then summed and stored as the final result.

#define THREADS_PER_BLOCK 32
__global__ void cuda_mat_mul_kernel(const int *rows,
                                    const int *cols,
                                    const scalar_t *data,
                                    const scalar_t *v,
                                    scalar_t *dest,
                                    int N){

  __shared__ volatile scalar_t sums[THREADS_PER_BLOCK * 2];
  const int row_start = rows[blockIdx.x];
  const int row_end   = rows[blockIdx.x+1];
  
  scalar_t sum = 0.0f;
  for(int i = row_start + threadIdx.x; i < row_end; i += THREADS_PER_BLOCK){
     sum += data[i] * v[cols[i]];
  }
  
  sums[threadIdx.x] = sum;
  __syncthreads();

  /* hardcoded reduction for 32 threads */
  sums[threadIdx.x] += sums[threadIdx.x + 16];
  sums[threadIdx.x] += sums[threadIdx.x +  8];
  sums[threadIdx.x] += sums[threadIdx.x +  4];
  sums[threadIdx.x] += sums[threadIdx.x +  2];
  sums[threadIdx.x] += sums[threadIdx.x +  1];
  
  if(threadIdx.x == 0)
     dest[row] = sums[0];
}

This kernel allows reads from the data and cols vectors to be coalesced. The reduction works without __syncthreads() calls between each addition since the threads will be executing them simultaneously. This is similar to the reduction example in the NVIDIA GPU Computing SDK. This kernel improved perforamnce to about 5 milliseconds.

Texture Memory

Although reads to data and cols are not coalesced, accesses to the v vector is still non-coalesced. By caching v in texture memory, the random reads will be faster.

texture<scalar_t> texRef;
#define THREADS_PER_BLOCK 32
__global__ void cuda_mat_mul_kernel(const int *rows,
                                    const int *cols,
                                    const scalar_t *data,
                                    const scalar_t *v,
                                    scalar_t *dest,
                                    int N){

  __shared__ volatile scalar_t sums[THREADS_PER_BLOCK * 2];
  const int row_start = rows[blockIdx.x];
  const int row_end   = rows[blockIdx.x+1];
  
  scalar_t sum = 0.0f;
  for(int i = row_start + threadIdx.x; i < row_end; i += THREADS_PER_BLOCK){
     //sum += data[i] * v[cols[i]];
     sum += data[i] * tex1Dfetch(texRef, cols[i]);
  }
  
  sums[threadIdx.x] = sum;
  __syncthreads();

  /* hardcoded reduction for 32 threads */
  sums[threadIdx.x] += sums[threadIdx.x + 16];
  sums[threadIdx.x] += sums[threadIdx.x +  8];
  sums[threadIdx.x] += sums[threadIdx.x +  4];
  sums[threadIdx.x] += sums[threadIdx.x +  2];
  sums[threadIdx.x] += sums[threadIdx.x +  1];
  
  if(threadIdx.x == 0)
     dest[row] = sums[0];
}

/* on CPU */
size_t offset = size_t(-1);
cudaBindTexture(&offset, texRef, v);

Using texture memory increases performance to about 3 milliseconds.

Other potential improvements

I implemented several other improvements which achieved small performance improvements. Results may vary.

Summing more rows per block (more threads per row), ensuring that a high occupancy is met (profile with COMPUTE_PROFILE=1).
Adjusting the number of blocks run.
Aligning each row to a cache line achieves a small speedup, since the coalesced reads will be aligned.

Implementing these further increased performance to about 2 milliseconds.

A final word

If you are in fact writing something similar, existing libraries like cuBLAS and cusparse are likely much faster than what one would create in a reasonable amount of time. The NVIDIA GPU-Accelerated Libraries page is a good place to start. The excellent CUSP C++ library also has a CSR kernel almost equivalent to this one, and is a very convenient library for most sparse matrix applications.