First, we will review our code foundation and cover some odds and ends.

  • A careless error in John's distance formula has been corrected - I won't embarass him anymore than that.
  • In the original MPU code a lock is unnecessarily held in the main calculation loop. The MpuTsp_Better class corrects that error with this improved loop:
Parallel.For(0, _permutations, 
   () => new LocalData(float.MaxValue, -1L),
   (permutation, state, localData) => {
      var path		= new int[1, _cities];
      var distance	= FindPathDistance( permutation, path, 0);
      if (distance < localData.BestDistance) {
         localData.BestDistance		= distance;
         localData.BestPermutation	= permutation;
      }
      return localData;
   },
   (localData) => {
      lock (locker) { 
         if (localData.BestDistance < bestDistance) {
            bestDistance	= localData.BestDistance;
            bestPermutation	= localData.BestPermutation;
         }
      }
   }
);

I was surprised at how litle improvement that made in the timing on an 8-core system. This is our new comparison base for the advantages of GPU-enabled algorithms.

  • Notice the times and timing changes for GpuTsp0-cold and -warm; this is the identical code run twice in succession. I left the first in to warm-up the GPU, and JIT-compile the CUDA and CUDAfy, so we can get a better comparison against the time for GpuTsp-warm.
  • The large load times for the first run of each test is for compiling and CUDAfy-ing the tests. In production this time would be eliminated by caching the result, as seen in the second time for each test.

Next: Structs & Strides - Basic GPU Memory Access

 

Last edited Nov 4, 2012 at 11:35 PM by pgeerkens, version 4

Comments

WilBennett Dec 10, 2012 at 8:00 PM 
Hello,

There is another "minor" change that you can make that will let you see the speedup you were expecting.

1. Create a partition - var part = Partitioner.Create(0, _permutations);
2. Instead of doing a Parallel.For, do Parallel.ForEach - Parallel.ForEach(part, range => {
3. Wrap the inner portion in a loop - for (long permutation = range.Item1; permutation < range.Item2; ...
4. You can create the local data as the first statement in the loop

This will be closer to what the GPU version is doing.

Wil