This project has moved. For the latest updates, please go here.

Twelve factorial amounts to just under 500,000,000, so multiplying that by 13 is greater than Int32.MaxValue, and even UInt32.MaxValue. To analyze further it is necessary to employ Int64 values for Permutations and Permutation, which is investigated in GpuTsp4*.cs.  As shown here though, we get a performanc shock when we run these cases:

x64 Release ( 128 threads_per * 768 blocks = 98304 threads)
Cities  12;   Permutations:  479001600:
---------------------------------------
... and now with disk cache populated.
           Total     Load      Run
              ms       ms       ms  distance
GpuTsp3     3301 =     93 +   3208; 111.3318: - 3_Architecture_x64_2_1
GpuTsp3a    2308 =     98 +   2210; 111.3318: - 3_PathArrayStrided
GpuTsp3b    2328 =    101 +   2227; 111.3318: - 3_DivisorsCachedGlobal

GpuTsp4     8953 =    100 +   8853; 111.3318: - 4_Long
GpuTsp4a    7841 =    103 +   7738; 111.3318: - 4_PathArrayStrided
GpuTsp4b    6643 =    100 +   6543; 111.3318: - 4_DivisorsCachedGlobal

Even Striding the path array only provides marginal improvemnt for the *3 performance hit from expanding fromInt32 to Int64; both less marginal and less relative improvement thatn it provided for Int32. What has gone wrong?

The answer lies in the extremely trim architecture of the GPU's themselves, focussed on fast single-precision floating-point calculations. After much investigation, it turns out that the offending code is a single line in the implementation of PathFromRoutePermutation, bolded below:

[Cudafy]
public static float PathFromRoutePermutation(GThread thread, 
				long  permutations, long  permutation, int[,] path) {
   for (int city = 0; city < _cities; city++) { path[city, thread.threadIdx.x] = city; }

   // Credit: SpaceRat. 
   // http://www.daniweb.com/software-development/cpp/code/274075/all-permutations-non-recursive
   var divisor = permutations;
   for (int city = _cities; city > 1L; /* decrement in loop body */) {
      divisor	/= city;
      int dest	= (int)((permutation / divisor) % city);

   city--;

      var swap				= path[dest, thread.threadIdx.x];
      path[dest, thread.threadIdx.x]	= path[city, thread.threadIdx.x];
      path[city, thread.threadIdx.x]	= swap;
   }
   return 0;
}

The calculation of divisor is a killer because there is absolutely no hardware support for 64-bit integer division on NVIDIA GPU's at this time.

But after a good night's rest I remembered Rule #1:

Never Divide When You Can Multiply Instead!

Last edited Nov 5, 2012 at 3:42 PM by pgeerkens, version 7