REU Summer 2007: Traffic Analysis Improvements Using a GPU
This summer I worked on improving the performance of Rabin fingerprinting by using a GPU to do lookups on the data. This has to be done very differently than normal consecutive execution, because everything is parallelized. I used a new NVidia G80 card to do this, which uses their CUDA (Compute Unified Device Architecture) technology. Basically it is C with some added functionality to parallelize projects.
The way the architecture is set up on CUDA is that GPU functions called kernels are coded, and each kernel contains a grid. Each grid contains a multi-dimensional array of blocks, and each block contains a multi-dimensional array of threads. Each block and thread is usually treated as a single entity, but the thread and block IDs can be referenced if something needs to only be done once per kernel (i.e. if (blockIdx.x
= 0 && threadIdx.x = 0) ... ) or once per block. One of the problems that was run into was that there are basically 3 types of memory. There is device memory (540MB on the 8800 GTS) which is significant on a per-grid basis, shared memory (16 kB) which is significant on a per-block basis, and local memory/registers (8192 registers) which is significant on a per-thread basis. Local and shared memory is on-chip, so it is extremely fast to access. However, device memory can take several hundred clock cycles to access.
Because of the size constraint on shared memory, most data was stored in the device memory and it was copied down to shared or local memory when new values were needed.
First, the idea was to start off with an empty hash table and a list of test values to check. Because there would be nothing to check against, they would be added to the hash table. After that, more values would be checked and if they were found in the table, the count of the hash that was found would be incremented. If it wasn't found in the table, it would be added to the table. Because a lot of lookups are being done, device memory has to be accessed very frequently. It must copy memory down from device to shared memory every time it runs out (I used 8kB to make sure I didn't use too much shared memory). This code can be found
here. This turned out to be slower than the original code running on a CPU.
To help alleviate some of the accesses to device memory, the next idea was to statically generate a table and check against that, so instead of adding to the table if the hash wasn't found, the hash would just be ignored. I wrote a hash generator function (source
here) to generate a hash table and drop it into a file. Some modifications had to be added to the original code to incorporate these changes, but it ended up mostly working. Unfortunately, performance really did not increase. The new source can be found
here.
There could have been several causes for the performance problems. Parallel execution slows down considerably when conditional statements are used, and because of the nature of what I was doing, conditional statements had to be used a lot. I believe this was the cause of most of the performance hit. Also, the aforementioned slow access to global memory could have been an issue as well. The CUDA architecture is really designed upon every thread acting independently, like doing operations on giant matrices or doing graphics work. So I believe that these problems would be alleviated on a project that takes that into consideration.
One of the issues I ran into was the difficulty in debugging. CUDA can use several flags when compiling to turn on debugging, but they don't tend to be very descriptive. Debug (make dbg=1) mode is used to debug when a kernel fails, but it doesn't say why it fails (the most common error is "Kernel execution failed"). I found the most common problem with debug errors was memory access violations, so I had to track down exactly what was being malloc'd and making absolutely sure that nothing is accessing memory outside of that. I believe another problem was that threads could diverge and one thread could be stuck in a loop, crashing the kernel. This is much harder to track down, but I believe this was happening because some loops were dependent upon a global memory value. The kernel can also be ran in emulation mode (make emu=1), which compiles the code to run on the CPU, but this is really used to track down logic problems and not thread/memory access problems.
So even though we didn't get done what we wanted to, I still learned a lot about parallel execution and, more specifically, about coding for CUDA.
--
NWK - 02 Aug 2007