Tests GpuTsp_2 and GpuTsp_3a introduce a key concept in PGU programming: understanding memory access. Understanding how the thread-warp on each GPU access memory on each instruction cycle, particularly how it access the very fast (on-chip) Shared Memory, is essential to getting the best performance.

The best description of Shared Memory structure is to visualize it as a 2D table. This table appears to be in row-major order when conceptualized by a serial algorithm, such as when written to, or read from, by the host CPU; however for optimal performance it must appear in column-major order for each thread of a warp. In the simple 1D case here that means putting the thread-threadidx.x index last. The same analysis holds for the paired-arrays gpuLatitude and gpuLongitude.

The CUDA and CUDAfy infrastructures endeavour to align data suitably, but it is also incumbent of developers to recognize the pattern. Thus a substantial performance gain is seen by converting the paired arrays gpuLatitude and gpuLongitude to a struct array of LatLongStruct, and re-ordering the dimensions of the paths array.

public struct LatLongStruct {
public float Latitude;
public float Longitude;
public float Distance(LatLongStruct other) {
return (float)GMath.Sqrt(Square(Latitude - other.Latitude)
+ Square(Longitude - other.Longitude));
public float Square(float value) { return value * value; }
public static void GpuFindPathDistance(GThread thread, 
   int  permutations, LatLongStruct[] gpuLatLong, AnswerStruct[] answer) {
   var threadsPerGrid	= thread.blockDim.x * thread.gridDim.x;
   var paths		= thread.AllocateShared<int>("path", _cities, _threadsPerBlock);
   var bestDistances	= thread.AllocateShared<float>("dist", _threadsPerBlock);
   var bestPermutations = thread.AllocateShared<int> ("perm",	_threadsPerBlock);

   var permutation	= (int)(thread.threadIdx.x + thread.blockIdx.x * thread.blockDim.x);
   var bestDistance	= float.MaxValue;
   var bestPermutation	= 0;
   while (permutation < permutations) {
      var distance = FindPathDistance(thread, permutations, permutation, gpuLatLong, paths);
      if (distance < bestDistance) {
         bestDistance	= distance;
         bestPermutation= permutation;
      permutation	  += threadsPerGrid;

Along the way, in test GpuTsp_3 we also set the Architecture and Compute Capability (2.1 for the GT 630M GPU) before compiling and CUDAfying the code:
internal override Answer GetAnswer() {
   var stopWatchLoad	= Stopwatch.StartNew();
   using (var gpu	= CudafyHost.GetDevice()) {
      var arch		= gpu.GetDeviceProperties().Capability.GetArchitecture();
      LoadTime	= stopWatchLoad.ElapsedMilliseconds;
      // etc.
   // etc.

This latter may not seem beneficial in the short 11-city case, but in the 12-city case it provides a small net gain. If you know that specific features of newer architectures will benefit your performance, the code above demonstrates how to query the available devide and instruct CUDAfy to use it to advanatage.

Next: 13 Factorial Doesn't Compute!

Last edited Dec 9, 2012 at 11:47 AM by pgeerkens, version 9


No comments yet.