CodePlexProject Hosting for Open Source Software

Twelve factorial amounts to just under 500,000,000, so multiplying that by 13 is greater than Int32.MaxValue, and even UInt32.MaxValue. To analyze further it is necessary to employ Int64 values for Permutations and Permutation, which is investigated in GpuTsp4*.cs. As shown here though, we get a performanc shock when we run these cases:

x64 Release ( 128 threads_per * 768 blocks = 98304 threads) Cities 12; Permutations: 479001600: --------------------------------------- ... and now with disk cache populated. Total Load Run ms ms ms distance GpuTsp3 3301 = 93 + 3208; 111.3318: - 3_Architecture_x64_2_1 GpuTsp3a 2308 = 98 + 2210; 111.3318: - 3_PathArrayStrided GpuTsp3b 2328 = 101 + 2227; 111.3318: - 3_DivisorsCachedGlobal GpuTsp4 8953 = 100 + 8853; 111.3318: - 4_Long GpuTsp4a 7841 = 103 + 7738; 111.3318: - 4_PathArrayStrided GpuTsp4b 6643 = 100 + 6543; 111.3318: - 4_DivisorsCachedGlobal

Even *Striding* the *path *array only provides marginal improvemnt for the *3 performance hit from expanding fromInt32 to Int64; both less marginal and less relative improvement thatn it provided for Int32. What has gone wrong?

The answer lies in the extremely trim architecture of the GPU's themselves, focussed on fast single-precision floating-point calculations. After much investigation, it turns out that the offending code is
**a single line** in the implementation of *PathFromRoutePermutation*, bolded below:

[Cudafy] public static float PathFromRoutePermutation(GThread thread, long permutations, long permutation, int[,] path) { for (int city = 0; city < _cities; city++) { path[city, thread.threadIdx.x] = city; } // Credit: SpaceRat. // http://www.daniweb.com/software-development/cpp/code/274075/all-permutations-non-recursive var divisor = permutations; for (int city = _cities; city > 1L; /* decrement in loop body */) {divisor /= city;int dest = (int)((permutation / divisor) % city); city--; var swap = path[dest, thread.threadIdx.x]; path[dest, thread.threadIdx.x] = path[city, thread.threadIdx.x]; path[city, thread.threadIdx.x] = swap; } return 0; }

The calculation of *divisor *is a killer because *there is absolutely no hardware support for 64-bit integer division on NVIDIA GPU's at this time.*

But after a good night's rest I remembered Rule #1:

Last edited Nov 5, 2012 at 4:42 PM by pgeerkens, version 7