Re: [Gems-users] Number of instructions executed per core


Date: Thu, 25 Feb 2010 14:38:03 -0600
From: Dan Gibson <degibson@xxxxxxxx>
Subject: Re: [Gems-users] Number of instructions executed per core
7. Every #pragma parallel or #pragma parallel for has an implicit barrier at the end, unless nobarrier is specified.

8. You have added the barrier in the wrong place. You are actually introducing artificial load imbalance with its current placement. Put the barrier /before/ the magic instruction.

9. Our art source code differs. I show that loop as schedule(dynamic), not schedule(static). schedule(dynamic) should do a better job reducing imbalance if there is variance in iteration duration. We are using specOMP2001 version string 35.

On Thu, Feb 25, 2010 at 2:31 PM, <ubaid001@xxxxxxx> wrote:
Hi,



1. Lack of explicit locking does not imply lack of spinning. There are
implicit barriers at the end of most OpenMP #pragmas, and looking at art's
source code, I see that there are several #pragmas without barrier elision.
Moreover, it is possible to spin in the OS, or in other processes.


   Yeah i agree that there are pragmas but I have added barriers wherever needed.

I run simics till this magic breakpoint and the load Opal and Ruby and simulate the parallel loop.

puts("magic1");
MAGIC(0x40000);
#pragma omp barrier// the barrier so that all Simics proc can reach the same pt

#pragma omp for private (k,m,n, gPassFlag) schedule(static)   for (ij = 0; ij < ijmx; ij++)      {        j = ((ij/inum) * gStride) + gStartY;
     i = ((ij%inum) * gStride) +gStartX;
     k=0;
     for (m=j;m<(gLheight+j);m++)
#pragma noswp
       for (n=i;n<(gLwidth+i);n++)
         f1_layer[o][k++].I[0] = cimage[m][n];
                     gPassFlag =0;
     gPassFlag = match(o,i,j, &mat_con[ij], busp);

     if (gPassFlag==1)
     {
#ifdef DEBUG
      printf(" at X= %d Y = %d\n",i,j);
#endif
      if (set_high[o][0]==TRUE)
      {
        highx[o][0] = i;
        highy[o][0] = j;
        set_high[o][0] = FALSE;
      }
      if (set_high[o][1]==TRUE)       {
        highx[o][1] = i;
        highy[o][1] = j;
        set_high[o][1] = FALSE;
      }
     }

#ifdef DEBUG
     else if (DB3)
      printf("0.00#%dx%da%2.1fb%2.1f\n",i,j,a,b);
#endif

     }  
puts("magic2");

   }


And this load imbalance was present on other benchmarks as well. Plus these benchmarks do not show a load imbalance when run on an actual multi processor system (itanium 2 montecito processor 4 processor SMP with each processor being dual core).




2.It is entirely possible that something other than art is running on your
cores. Look into pset_bind, and processor_bind. With openMP, I'd recommend
pset_create instead of explicit binding.

 This is a good possibilty. Will look into this.

3. I'm sure its been said on this list before (because I have said it) that
instruction count is a BAD metric for multithreaded code.

 Yeah. But the faulty core is almost off the others in execution by almost 100% which is spurious.



Suhail





On Feb 25 2010, Dan Gibson wrote:

I believe what you are observing is inherent to the simulation, and to real
executions, for the following reasons:

1. Lack of explicit locking does not imply lack of spinning. There are
implicit barriers at the end of most OpenMP #pragmas, and looking at art's
source code, I see that there are several #pragmas without barrier elision.
Moreover, it is possible to spin in the OS, or in other processes.
2. It is entirely possible that something other than art is running on your
cores. Look into pset_bind, and processor_bind. With openMP, I'd recommend
pset_create instead of explicit binding.
3. /Simics/ does not choose what code runs on which core. The operating
system does that. Look for ways to affect the OS, not Simics.
4. I'm sure its been said on this list before (because I have said it) that
instruction count is a BAD metric for multithreaded code.
5. art on my serengeti target takes a LOT of TLB misses (one almost every
iteration). I'm not sure if individual cores would react differently or not
to TLB misses.
6. art uses a dynamically-scheduled parallel section. Load imbalance in
those iterations would cause one core to lag or complete early.

Regards,
Dan

On Thu, Feb 25, 2010 at 1:51 PM, <ubaid001@xxxxxxx> wrote:


Since there are no mutex locks, no processor is spinning. I have only my
benchmark running on my Simics Target Machine (Serengeti). Is there a
possiblity that the faulty core is running some other program rather than
the art thread?

Also is there any way in Simics, to bind a thread to a particular
processor?
So that i know for sure that all my processors are running the user thread.



Suhail



On Feb 25 2010, ubaid001@xxxxxxx wrote:

 Hi,

I had brought this issue earlier. I am running Spec Open MP benchmark
(art)
on a 4 core CMP system (OPAL + RUBY).

But there seems to be a huge difference in the number of instructions
executed between one processor and the rest. I know that there are no mutex
locks in the code. Infact I load opal and ruby only from the parallel
section of the program.

The one core either lags behind or leads the other processor. And this
happens on every single simulation.

Can anyone shed more light on this?

Suhail

 _______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:
https://lists.cs.wisc.edu/archive/gems-users/" to your search.





_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/" to your search.




--
http://www.cs.wisc.edu/~gibson [esc]:wq!
[← Prev in Thread] Current Thread [Next in Thread→]