Friday, July 8, 2011

Massively parallel programming and the GPGPU

I don't often delve into the deep technicalities of programming, but I was running a model that took 4 days to finish and decided there must be a better way. The model was already optimised, running in parallel etc, so there was nothing for it but to have a go and using a graphics card to do the computation.

A quick Google suggested mixed views. Clearly in the best cases speedups of 40 or more possible, but that was after some pretty complex optimization (exactly what memory you use on the card and how you access seemed to play a big part in the gains that you get). However, 4 days was not really satisfactory.

Being a die hard .net user and not really willing to learn anything new a bit more Googling revealed a number of potential ways of automatically converting .net code to work with the graphics card. I had a look at 3 or 4 different methods and the one I chose is Cuda running on a Nvidia card (as it is by far the best supported GPGPU as far as I can tell) and Cudafy (available from http://www.hybriddsp.com/Products/CUDAfyNET.aspx (free) as it seemed the most elegant approach.

The effort wasn't too bad - Cudafy automatically translates .net code into something that can run on the graphics card so all you need to do is to figure out how to get Cudafy to work. It took about a day (I'd like to see a lot more documentation) but it worked pretty much as it claimed. - various little issues, like checking for integer overflows is switched on in the Visual Studio compiler and needs to be switched off when generating the graphics code took a bit of time to uncover but now they are sorted they will remain sorted. Hats off to Hybrid for producing a very easy to use product.

Converting your code is also pretty trivial - Provided that its written to be analysed in parallel in the first place that is. You add a variable that tells the "kernel" (sub routine in more traditional language) which thread is running the kernel and what you want each thread to do and let rip. Its possible to generate millions of threads and the graphics card will schedule them all over the multiple processors that it has available (240 in my case).

The results are amazing - my out of the box conversion (no memory optimizations or anything fancy just a default translation of my code) produced about a 20 fold increase in speed over the .net version. I reckon that by being a little bit fancy you could probably double the speed on the graphics card. - a day well spent. It will definitely become my method of choice for running large models and datasets.

1 comment:

Mark said...

Very cool! Thanks for sharing your experience. I hadn't heard of Cudafy, but will definitely check it out.