« OpenStreetMap's Role in Haiti | Home | Expat Financial Scams »

February 25, 2010

OpenCL Calculation and Reduction

Otherwise known as "calculating a long list of numbers then adding them all up".  For a GPGPU (OpenCL) simulation program at work, I needed to calculate around 160 numbers which would be averaged to produce one result for storage in a 1024x1024 element array.  That's 160 numbers for each of 1024x1024 pixels, which would be a lot to store as an intermediate result for a later step of averaging on the GPU, or (heaven forbid) to be copied back to system memory.

The magic word to search for in tackling this is reduction, and there's plenty of hardcore compsci knowledge about how to make it go as fast as possible in a parallel environment.  But, basically the trick is to have 160x1024x1024 threads operate in groups of 160 (one group for each of the 1024x1024 overall elements).  Threads cooperating like this can share memory, and each thread writes its individual value to an array in that local memory.  Then, one of the 160 threads adds up all the values and does a single write of the final average value to the global array.  For the kernel to test if it's running the "chosen thread" is as simple as something like this:

if ( get_local_id(0) == 0 )

The only bit of "funny business" is that each of the 160 threads has to have finished calculating before the results can be added.  That's done with this statement, which guarantees that all previous local memory writes have completed for all threads: 

barrier(CLK_LOCAL_MEM_FENCE);

This is a really simple example: for one thread to do all of the averaging is a waste of resources when the reduction itself could be parallelised.  In that case, one thread would (say) add up values 0-79 while another added up 80-159, then one of those threads would (after another barrier) add up the remaining two values.  It's easy to see how it can be broken down more and more, and there are variations which make better use of the GPU resources, avoid memory conflicts, and so on.

So, if you'd ever heard of the thread groups and local memory used in OpenCL (also CUDA) and wondered what they were good for, now you know..

NVidia's OpenCL Programming Guide has a lot of discussion of this topic, and there's loads more to be found around the web.

Leave a comment