Re: [Gems-users] Possible Bugs in EETM


Date: Fri, 25 Sep 2009 18:57:11 -0500
From: Polina Dudnik <pdudnik@xxxxxxxxx>
Subject: Re: [Gems-users] Possible Bugs in EETM
Ha, yes, I thought something must be off given that this worked for us and it's pretty basic. OK, thanks, we will check the rest.

Polina

On Fri, Sep 25, 2009 at 6:55 PM, BYONG WU CHONG <ByongWu.Chong@xxxxxxxx> wrote:

Polina,

 

After having a chat with an expert, it is seems that Bug #1 isn’t a bug.

I neglected the possibility of branch delay slot.

It seems this new Bug #4 was leading me to think that Bug #1 was the problem.

Please ignore Bug #1.

 

Bug #4:

  Using alternative general register instead of base general register.

  Please note that this bug depends on the Simics target system setting.

Related Files and Code:

  // ruby/log_tm/TransactionInterfaceManager.C

  void TransactionInterfaceManager::trapToHandler(int thread){

    int logical_proc_no = getLogicalProcID(thread);

    // We temporarily flip the LSB of PSTATE so that simics can access the program

    // global registers instead of the alternate globals. Note that we are currently

    // in a system trap.

    int pstate_rn_no = SIMICS_get_register_number(logical_proc_no, "pstate");

    uint64 pstate_val = SIMICS_read_register(logical_proc_no, pstate_rn_no);

    SIMICS_write_register(logical_proc_no, pstate_rn_no, (pstate_val ^ 0x1));  <-- target line

    int tid_rn_no = SIMICS_get_register_number(logical_proc_no, "g2");

    SIMICS_write_register(logical_proc_no, tid_rn_no, m_tid[thread]);

    SIMICS_write_register(logical_proc_no, pstate_rn_no, pstate_val);

Reason:

  On sending the current thread id to the trap handler, GEMS uses g2 register like above code.

  On an effort of setting AG bit (alternative general bit 0) to zero, it xors the original pstate value with 1.

  However, on the system with default value of 0 on AG bit, the line does exactly opposite and uses alternative general register of g2

  instead of using the base general register of g2. So the target may not receive the correct thread id when

  the system is configured with default AG bit zero. (Mine was configured with AG bit zero)

Solution:

  Use (& 0xFFFFFFFE) instead of (^ 0x1).

 

Thanks,

 

- Byong-Wu Chong

 

 

From: gems-users-bounces@xxxxxxxxxxx [mailto:gems-users-bounces@xxxxxxxxxxx] On Behalf Of Polina Dudnik
Sent: Friday, September 25, 2009 3:04 PM
To: Gems Users
Subject: Re: [Gems-users] Possible Bugs in EETM

 

Hi Byong-Wu,

Thanks for your input, we will check it out and get back to you with whether or not those were indeed bugs. Thanks.

Polina

On Thu, Sep 24, 2009 at 8:56 PM, BYONG WU CHONG <bernard.chong@xxxxxxxx> wrote:

Hello,

 

Here are some critical bugs I found on GEMS EETM. I made sure that the solutions to these bugs made EETM more stable by running the GEMS simulation on STAMP benchmarks.

I thought HTM researchers might be interested about these.

 

 

Bug #1:

  Not including the thread id to tm_trap_handler function call.

Related Files and Code:

  // microbenchmarks/transactional/common/transaction.c

  void tm_trap_handler(int threadID){

    ...

  }

  ...

  void transaction_manager_stub(int dummy){

    ...

        BEGIN_ESCAPE

        asm volatile(   \

              "call %1\n"        \

              "mov %%g2, %%O0\n" \

              "mov %%g3, %0\n"   \

              :"=r"(restart)

              :"r"(&tm_trap_handler)

              :"%o0", "%o7"

              );

    ...

  }

Reason:

  The GCC inline assembly used on transaction_manager_stub() function does not

  correctly include the thread id (register g2 sent from GEMS) value to tm_trap_handler function.

Solution:

  The second assembly line of "mov %%g2, %%o0\n" should be moved before the "call %1\n" line.

 

 

Bug #2:

  Software abort trap handler unrolling the trap handler's stack space.

Reason:

  Eager VM logs thread's stack space and it becomes trouble when log unrolling takes place on abort trap handler's

  own stack space. It will overwrite itself with log data and trap handler will act almost randomly.

  Randomly writing other log's data or using random memory section as log may happen.

Steps leading to an error:

  1.  A transaction enters a function and writes some data to heap and stack.

  2.  The data gets written to the stack, heap and the log.

  3.  The transaction exits the function and stack pointer shrinks.

  4.  The transaction gets aborted due to conflict in heap area.

  5.  The software abort trap handler starts log unrolling from the shrunken stack pointer.

  6.  The undo log space and the stack space for the trap handler collides.

  7.  Trap handler unrolls the log and overwrites itself.

  8.  Corruption in trap handler's stack space cause random log unrolling.

  9.  System goes haywire.

Solution:

  1.  Take stack pointer just before starting trap handling.

  2.  Ignore undo log when the target data is pointed at the trap handler's stack area.

      We can figure this out if the target address is between the current and old stack pointer.

Related Files and Code:

  // ruby/microbenchmarks/transactional/common/transaction.c

  void tm_unroll_log_entry(unsigned int* entry){

    int k;

 

    unsigned int *address = (unsigned int *) (*(entry+16) & ADDRESS_MASK);

 

    for (k = 0; k < 16; k++){

      // NOTE: This should be the place for checking writing to itself.

      unsigned int data = "" + k);

      *address = data;

      address++;

    }

  }

 

 

Bug #3:

  Not aborting the transaction when there is a conflict with non-TM or escape action.

Reason:

  Due to 16 words cache granularity there is a possiblity of false conflict between a transaction

  and non-transactional code. Current GEMS version just allow non-TM code to read the transactionally isolated line.

Steps leading to an error:

  Step 1: thread 1's transaction trA writes x

  Step 2: thread 2's non-transactional code reads y.

          Unfortunately word x and word y shares the same cache line.

  Step 3: thread 1 allows thread 2's read request.

  Step 4: thread 2 starts a new transaction trB.

  Step 5: trB reads line x, but does not report to L2 because it's in shared state.

          It gets a simple line hit. Therefore trA is unaware of conflict.

  Step 6: trA aborts

  Step 7: trB writes z = x + y and commits. word z is on a different cache.

Solution:

  When a non-TM read/write request comes and there is a hit on write-set perfect filter to a TM thread,

  the transaction searches from the front for old log value and send it to the requestor instead

  of sending the value from L1 cache.

  After sending out the old value, the transaction removes the conflicted address from

  the undo log and the write set. The transaction aborts.

  Note: This sending out the old value and removing the conflicted address from the undo log

        and the write set is important because we don't want to overwrite later what non-TM

        might have written at the time of conflict. Also note that this sending out the old

        log value has been practiced on STM compilers, too.

Related Files and Code:

  // ruby/protocols/MESI_CMP_filter_directory-L1cache.sm

  transition(M, Fwd_GETX, I)

  {

    d_sendDataToRequestor;

    l_popRequestQueue;

  }

  transition(M, Inv, I)

  {

    f_sendDataToL2;

    l_popRequestQueue;

  }

 

 

- Byong-Wu Chong

 


_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/" to your search.

 


_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/" to your search.



[← Prev in Thread] Current Thread [Next in Thread→]