This project has moved and is read-only. For the latest updates, please go here.

This offending code in PathFromRoutePermutation

var divisor = permutations;
for (int city = _cities; city > 1L; /* decrement in loop body */) {
   divisor	/= city;
   var dest	= (int)((permutation / divisor) % city);
   city--;

has imposed a 200+% performance penalty on converting from Int32 to Int64 . Fortunately, an equally valid generation of paths is available with this code, which multiplies instead of dividing.

var divisor = 1L;
for (int city = _cities; city > 1L; /* decrement in loop body */) {
   var dest	= (int)((permutation / divisor) % city);
   divisor	*= city;
   city--;

And when we run the before and after for 12 cities we obtain spectacular results:

x64 Release ( 128 threads_per * 768 blocks = 98304 threads)
Cities  12;   Permutations:  479001600:
---------------------------------------
           Total     Load      Run
              ms       ms       ms  distance
GpuTsp4     8953 =    100 +   8853; 111.3318:  - 4_Long
GpuTsp4a    7841 =    103 +   7738; 111.3318:  - 4_PathArrayStrided
GpuTsp4b    6643 =    100 +   6543; 111.3318:  - 4_DivisorsCachedGlobal

GpuTsp4c    2639 =    103 +   2536; 111.3318:  - 4_MultiplyInstead

Changing those two lines around, from a non-hardware-supported Int-64 division to a hardware-supported Int-64 multiplication provided a 65% time reduction. With bang like that for our buck, even implementing it on the Int-32 division was worthwhile, as seen here, proving that even hard-ware supprted integer division can be slow.

x64 Release ( 128 threads_per * 768 blocks = 98304 threads)
Cities  12;   Permutations:  479001600:
---------------------------------------
           Total     Load      Run
              ms       ms       ms  distance
GpuTsp3b    2328 =    101 +   2227; 111.3318:  - 3_DivisorsCachedGlobal

GpuTsp3c    1935 =    105 +   1830; 111.3318:  - 3_MultiplyInstead

To wrap up for now, one run with 13 cities (with Int-64's of course):

x64 Release ( 128 threads_per * 768 blocks = 98304 threads)
Cities  13;   Permutations: 6,227,020,800:
---------------------------------------
           Total     Load      Run
              ms       ms       ms             distance
GpuTsp4c   38830 =   3228 +  35602; 111.9742:  - 4_MultiplyInstead

At 14 * the 12-city time, our implementation is still scaling reasonably.

Last edited Nov 5, 2012 at 4:45 PM by pgeerkens, version 7