This offending code in PathFromRoutePermutation

var divisor = permutations;
for (int city = _cities; city > 1L; /* decrement in loop body */) {
   divisor	/= city;
   var dest	= (int)((permutation / divisor) % city);
   city--;

has imposed a 200+% performance penalty on converting from Int32 to Int64 . Fortunately, an equally valid generation of paths is available with this code, which multiplies instead of dividing.

var divisor = 1L;
for (int city = _cities; city > 1L; /* decrement in loop body */) {
   var dest	= (int)((permutation / divisor) % city);
   divisor	*= city;
   city--;

And when we run the before and after for 12 cities we obtain spectacular results:

x64 Release ( 128 threads_per * 768 blocks = 98304 threads)
Cities  12;   Permutations:  479001600:
---------------------------------------
           Total     Load      Run
              ms       ms       ms  distance
GpuTsp4     8953 =    100 +   8853; 111.3318:  - 4_Long
GpuTsp4a    7841 =    103 +   7738; 111.3318:  - 4_PathArrayStrided
GpuTsp4b    6643 =    100 +   6543; 111.3318:  - 4_DivisorsCachedGlobal

GpuTsp4c    2639 =    103 +   2536; 111.3318:  - 4_MultiplyInstead

Changing those two lines around, from a non-hardware-supported Int-64 division to a hardware-supported Int-64 multiplication provided a 65% time reduction. With bang like that for our buck, even implementing it on the Int-32 division was worthwhile, as seen here, proving that even hard-ware supprted integer division can be slow.

x64 Release ( 128 threads_per * 768 blocks = 98304 threads)
Cities  12;   Permutations:  479001600:
---------------------------------------
           Total     Load      Run
              ms       ms       ms  distance
GpuTsp3b    2328 =    101 +   2227; 111.3318:  - 3_DivisorsCachedGlobal

GpuTsp3c    1935 =    105 +   1830; 111.3318:  - 3_MultiplyInstead

To wrap up for now, one run with 13 cities (with Int-64's of course):

x64 Release ( 128 threads_per * 768 blocks = 98304 threads)
Cities  13;   Permutations: 6,227,020,800:
---------------------------------------
           Total     Load      Run
              ms       ms       ms             distance
GpuTsp4c   38830 =   3228 +  35602; 111.9742:  - 4_MultiplyInstead

At 14 * the 12-city time, our implementation is still scaling reasonably.

Last edited Nov 5, 2012 at 3:45 PM by pgeerkens, version 7

Comments

pgeerkens Mar 19, 2013 at 9:34 AM 
@WilBennett: Yes, I am running on a second graphics card.

WilBennett Dec 10, 2012 at 8:15 PM 
Question.

Are you running this on a GPU other than the one that is connected to your display? When I run it on my 580 with 13 cities, I get a launch timeout (because Windows limits the time the GPU can be busy).

Thanks,
Wil

WilBennett Dec 10, 2012 at 8:05 PM 
Hello,

You can get rid of the multiply all together. The calculation will be the same regardless of the permutation you are on. You can do the calculation once in the outer function and pass it in as an array of int. it's just a simple matter of indexing then!

Wil