Wednesday, August 15, 2012

Optimizing performance when using atomic operations

Here is the problem: We have a list of containers, each of them containing a variable number of floats. What we want to achieve is counting the number of containers for which the sum of float values is greater than a given limit.
The parallelism is done at a container level, meaning that each thread will process a container. The algorithm per thread is simple: walk through the list of floats and sum them up. Once this is done, there is a concurrency issue when it comes to counting the number of containers with a sum that is greater than our given limit. The simple solution is to use an atomic add in each thread when the limit condition is true. But cannot we be a bit more efficient? Of course we can!
On GPUs, threads are organized by blocks, and memory can be easily and efficiently shared between threads from the same block. So the idea is to use the atomic add at a block level rather than at a thread one. To do so, a shared 2D array is instanciated for a given 2D block so that each thread from the block has its own item in the array. By doing so, we avoid concurrency between threads from the same block.
__shared__ int r[BLOCK_SIZE][BLOCK_SIZE];

Each thread from the block can now perform its algorithm and store its result in the shared array.

// Compute the index
int x = (blockDim.x*blockIdx.x + threadIdx.x);
int y = (blockDim.y*blockIdx.y + threadIdx.y);
int index = y*(blockDim.x*blockIdx.x) + x;
      
// Here comes the Business stuff
Constainer p = containers[index];
int i = 0;
float sum = 0;
for( i = 0; i<p.nbElements; ++i )
{
  sum += p.elements[i];
}
r[threadIdx.x][threadIdx.y] = ( sum > limit ) ? 1 : 0;

Once the block of threads is done and synchronized using the __syncthreads() cuda function, the elements from the array are summed up and the result is stored in the global counter, using an atomic add. Note that this process should only be performed by one single thread from the block.

if( threadIdx.x == 0 && threadIdx.y == 0 )  // Only one thread should do this job.
{
  int s = 0;
  for( int x=0; x<BLOCK_SIZE; ++x )
  {
    for( int y=0; y<BLOCK_SIZE; ++y )
    {
      s += r[x][y];
    }
  }
  atomicAdd(result,s);
}

The full code will look like this:

struct Container
{
   int nbElements;
   float elements[NB_MAX_ELEMENTS_PER_CONTAINER];
};

__global__ void cuda_kernel( Container* containers, int* result, float limit )
{
   __shared__ int r[BLOCK_SIZE][BLOCK_SIZE];

  // Compute the index
  int x = (blockDim.x*blockIdx.x + threadIdx.x);
  int y = (blockDim.y*blockIdx.y + threadIdx.y);
  int index = y*(blockDim.x*blockIdx.x) + x;

  // Here comes the Business stuff
  Constainer p = containers[index];
  int i = 0;
  float sum = 0;
  for( i = 0; i<p.nbElements; ++i )
  {
    sum += p.elements[i];
  }

  r[threadIdx.x][threadIdx.y] = ( sum > limit ) ? 1 : 0;

  __syncthreads();

  if( threadIdx.x == 0 && threadIdx.y == 0 )
  {
    int s = 0;
    for( int x=0; x<BLOCK_SIZE; ++x )
    {
      for( int y=0; y<BLOCK_SIZE; ++y )
      {
        s += r[x][y];
      }
    }
    atomicAdd(result,s);
  }
}

extern "C" void cuda_runKernel( unsigned int nbContainers, float limit )
{
  // Run the Kernel
  dim3 block(BLOCK_SIZE,BLOCK_SIZE,1);
  int  grid = nbContainers/(block.x*block.y);
  cuda_kernel<<<grid,block>>>( d_containers, d_result, limit );
}