Tuning Tutorial for CUDA applications using CUDAfy

An excellent introduction to CUDA and CUDAfy has been presented by here: cudafy-me-part-1-of-4.html using the Travelling Salesman problem. This tutorial expands on that presentation, showing how the performance of the selected algorithm can be speeded up almost 3* by better exploiting the strengths of the CUDA architecture as exposed through CUDAfy. As we step through obtaining that improvement, it is hoped that the reader will observe a routine which can later be applied to other problem applications.

Below is a listing of the results obtained by running the tuning steps for an 11-city TSP problem, with key performance benchmarks in bold. The run-time of an 11-city TSP problem implemented on a CUDA GPU is reduced from 379 ms to 135 ms or to 187 ms (using Int32 or Int64 respectively).

x64 Release ( 128 threads_per * 768 blocks = 98304 threads)
Cities  11;   Permutations:   39916800:
---------------------------------------
With disk cache empty and only a single class ...
           Total     Load      Run
              ms       ms       ms  distance
CpuTsp     14693 =      0 +  14693; 110.7368: 
MpuTsp      7404 =      0 +   7404; 110.7368: 
MpuTspA     5148 =      0 +   5148; 110.7368: - MpuTsp_Better
GpuTsp0     3012 =   2576 +    436; 110.7368: - cold
GpuTsp0     1832 =   1453 +    379; 110.7368: - warm
/* some detail elided - complete table available under Documentation */
... and now with disk cache populated.
           Total     Load      Run
              ms       ms       ms  distance
GpuTsp1      430 =     88 +    342; 110.7368: - 1_SeparateClass
GpuTsp2      231 =     90 +    141; 110.7368: - 2_StructArray

GpuTsp3      264 =     93 +    171; 110.7368: - 3_Architecture_x64_2_1
GpuTsp3a     263 =     93 +    170; 110.7368: - 3_PathArrayStrided
GpuTsp3b     269 =     98 +    171; 110.7368: - 3_DivisorsCachedGlobal

GpuTsp4      641 =     97 +    544; 110.7368: - 4_Long
GpuTsp4a     640 =    100 +    540; 110.7368: - 4_PathArrayStrided
GpuTsp4b     615 =    106 +    509; 110.7368: - 4_DivisorsCachedGlobal

GpuTsp3c     240 =    105 +    135; 110.7368: - 3_MultiplyInstead
GpuTsp4c     290 =    103 +    187; 110.7368: - 4_MultiplyInstead

Part 2: The Better MpuTsp - and some odds & ends

Part 3: Structs & Strides - Basic GPU Memory Access

Part 4: 13 Factorial Doesn't Compute!

Part 5: Never Divide When You Can Multiply Instead!

Last edited Dec 9, 2012 at 11:46 AM by pgeerkens, version 15