Tuning Tutorial for CUDA applications using CUDAfy
An excellent introduction to CUDA and CUDAfy has been presented by
John Michael Hauck here:
cudafy-me-part-1-of-4.html using the Travelling Salesman problem. This tutorial expands on that presentation, showing how the performance of the selected algorithm can be speeded up almost 3* by better exploiting the strengths of the
CUDA architecture as exposed through CUDAfy. As we step through obtaining that improvement, it is hoped that the reader will observe a routine which can later be applied to other problem applications.
Below is a listing of the results obtained by running the tuning steps for an 11-city TSP problem, with key performance benchmarks in
bold. The run-time of an 11-city TSP problem implemented on a CUDA GPU is reduced from 379 ms to 135 ms or to 187 ms (using Int32 or Int64 respectively).
x64 Release ( 128 threads_per * 768 blocks = 98304 threads)
Cities 11; Permutations: 39916800:
---------------------------------------
With disk cache empty and only a single class ...
Total Load Run
ms ms ms distance
CpuTsp 14693 = 0 + 14693; 110.7368:
MpuTsp 7404 = 0 + 7404; 110.7368:
MpuTspA 5148 = 0 + 5148; 110.7368: - MpuTsp_Better
GpuTsp0 3012 = 2576 + 436; 110.7368: - cold
GpuTsp0 1832 = 1453 + 379; 110.7368: - warm
/* some detail elided - complete table available under Documentation */
... and now with disk cache populated.
Total Load Run
ms ms ms distance
GpuTsp1 430 = 88 + 342; 110.7368: - 1_SeparateClass
GpuTsp2 231 = 90 + 141; 110.7368: - 2_StructArray
GpuTsp3 264 = 93 + 171; 110.7368: - 3_Architecture_x64_2_1
GpuTsp3a 263 = 93 + 170; 110.7368: - 3_PathArrayStrided
GpuTsp3b 269 = 98 + 171; 110.7368: - 3_DivisorsCachedGlobal
GpuTsp4 641 = 97 + 544; 110.7368: - 4_Long
GpuTsp4a 640 = 100 + 540; 110.7368: - 4_PathArrayStrided
GpuTsp4b 615 = 106 + 509; 110.7368: - 4_DivisorsCachedGlobal
GpuTsp3c 240 = 105 + 135; 110.7368: - 3_MultiplyInstead
GpuTsp4c 290 = 103 + 187; 110.7368: - 4_MultiplyInstead
Part 2:
The Better MpuTsp - and some odds & ends
Part 3:
Structs & Strides - Basic GPU Memory Access
Part 4:
13 Factorial Doesn't Compute!
Part 5:
Never Divide When You Can Multiply Instead!