Re: [Gems-users] Possible Bugs in EETM


Date: Fri, 25 Sep 2009 16:03:51 -0500
From: Polina Dudnik <pdudnik@xxxxxxxxx>
Subject: Re: [Gems-users] Possible Bugs in EETM
Hi Byong-Wu,

Thanks for your input, we will check it out and get back to you with whether or not those were indeed bugs. Thanks.

Polina

On Thu, Sep 24, 2009 at 8:56 PM, BYONG WU CHONG <bernard.chong@xxxxxxxx> wrote:

Hello,

 

Here are some critical bugs I found on GEMS EETM. I made sure that the solutions to these bugs made EETM more stable by running the GEMS simulation on STAMP benchmarks.

I thought HTM researchers might be interested about these.

 

 

Bug #1:

  Not including the thread id to tm_trap_handler function call.

Related Files and Code:

  // microbenchmarks/transactional/common/transaction.c

  void tm_trap_handler(int threadID){

    ...

  }

  ...

  void transaction_manager_stub(int dummy){

    ...

        BEGIN_ESCAPE

        asm volatile(   \

              "call %1\n"        \

              "mov %%g2, %%O0\n" \

              "mov %%g3, %0\n"   \

              :"=r"(restart)

              :"r"(&tm_trap_handler)

              :"%o0", "%o7"

              );

    ...

  }

Reason:

  The GCC inline assembly used on transaction_manager_stub() function does not

  correctly include the thread id (register g2 sent from GEMS) value to tm_trap_handler function.

Solution:

  The second assembly line of "mov %%g2, %%o0\n" should be moved before the "call %1\n" line.

 

 

Bug #2:

  Software abort trap handler unrolling the trap handler's stack space.

Reason:

  Eager VM logs thread's stack space and it becomes trouble when log unrolling takes place on abort trap handler's

  own stack space. It will overwrite itself with log data and trap handler will act almost randomly.

  Randomly writing other log's data or using random memory section as log may happen.

Steps leading to an error:

  1.  A transaction enters a function and writes some data to heap and stack.

  2.  The data gets written to the stack, heap and the log.

  3.  The transaction exits the function and stack pointer shrinks.

  4.  The transaction gets aborted due to conflict in heap area.

  5.  The software abort trap handler starts log unrolling from the shrunken stack pointer.

  6.  The undo log space and the stack space for the trap handler collides.

  7.  Trap handler unrolls the log and overwrites itself.

  8.  Corruption in trap handler's stack space cause random log unrolling.

  9.  System goes haywire.

Solution:

  1.  Take stack pointer just before starting trap handling.

  2.  Ignore undo log when the target data is pointed at the trap handler's stack area.

      We can figure this out if the target address is between the current and old stack pointer.

Related Files and Code:

  // ruby/microbenchmarks/transactional/common/transaction.c

  void tm_unroll_log_entry(unsigned int* entry){

    int k;

 

    unsigned int *address = (unsigned int *) (*(entry+16) & ADDRESS_MASK);

 

    for (k = 0; k < 16; k++){

      // NOTE: This should be the place for checking writing to itself.

      unsigned int data = "" + k);

      *address = data;

      address++;

    }

  }

 

 

Bug #3:

  Not aborting the transaction when there is a conflict with non-TM or escape action.

Reason:

  Due to 16 words cache granularity there is a possiblity of false conflict between a transaction

  and non-transactional code. Current GEMS version just allow non-TM code to read the transactionally isolated line.

Steps leading to an error:

  Step 1: thread 1's transaction trA writes x

  Step 2: thread 2's non-transactional code reads y.

          Unfortunately word x and word y shares the same cache line.

  Step 3: thread 1 allows thread 2's read request.

  Step 4: thread 2 starts a new transaction trB.

  Step 5: trB reads line x, but does not report to L2 because it's in shared state.

          It gets a simple line hit. Therefore trA is unaware of conflict.

  Step 6: trA aborts

  Step 7: trB writes z = x + y and commits. word z is on a different cache.

Solution:

  When a non-TM read/write request comes and there is a hit on write-set perfect filter to a TM thread,

  the transaction searches from the front for old log value and send it to the requestor instead

  of sending the value from L1 cache.

  After sending out the old value, the transaction removes the conflicted address from

  the undo log and the write set. The transaction aborts.

  Note: This sending out the old value and removing the conflicted address from the undo log

        and the write set is important because we don't want to overwrite later what non-TM

        might have written at the time of conflict. Also note that this sending out the old

        log value has been practiced on STM compilers, too.

Related Files and Code:

  // ruby/protocols/MESI_CMP_filter_directory-L1cache.sm

  transition(M, Fwd_GETX, I)

  {

    d_sendDataToRequestor;

    l_popRequestQueue;

  }

  transition(M, Inv, I)

  {

    f_sendDataToL2;

    l_popRequestQueue;

  }

 

 

- Byong-Wu Chong

 


_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/" to your search.



[← Prev in Thread] Current Thread [Next in Thread→]