This project has moved and is read-only. For the latest updates, please go here.

First, we will review our code foundation and cover some odds and ends.

  • A careless error in John's distance formula has been corrected - I won't embarass him anymore than that.
  • In the original MPU code a lock is unnecessarily held in the main calculation loop. The MpuTsp_Better class corrects that error with this improved loop:
Parallel.For(0, _permutations, 
   () => new LocalData(float.MaxValue, -1L),
   (permutation, state, localData) => {
      var path		= new int[1, _cities];
      var distance	= FindPathDistance( permutation, path, 0);
      if (distance < localData.BestDistance) {
         localData.BestDistance		= distance;
         localData.BestPermutation	= permutation;
      return localData;
   (localData) => {
      lock (locker) { 
         if (localData.BestDistance < bestDistance) {
            bestDistance	= localData.BestDistance;
            bestPermutation	= localData.BestPermutation;

I was surprised at how litle improvement that made in the timing on an 8-core system. This is our new comparison base for the advantages of GPU-enabled algorithms.

  • Notice the times and timing changes for GpuTsp0-cold and -warm; this is the identical code run twice in succession. I left the first in to warm-up the GPU, and JIT-compile the CUDA and CUDAfy, so we can get a better comparison against the time for GpuTsp-warm.
  • The large load times for the first run of each test is for compiling and CUDAfy-ing the tests. In production this time would be eliminated by caching the result, as seen in the second time for each test.

Next: Structs & Strides - Basic GPU Memory Access


Last edited Nov 5, 2012 at 12:35 AM by pgeerkens, version 4