[Gems-users] Possible Bugs in EETM

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date:	Thu, 24 Sep 2009 19:56:40 -0600
From:	BYONG WU CHONG <bernard.chong@xxxxxxxx>
Subject:	[Gems-users] Possible Bugs in EETM

Hello,

Here are some critical bugs I found on GEMS EETM. I made sure that the solutions to these bugs made EETM more stable by running the GEMS simulation on STAMP benchmarks.

I thought HTM researchers might be interested about these.

Bug #1:

Not including the thread id to tm_trap_handler function call.

Related Files and Code:

// microbenchmarks/transactional/common/transaction.c

void tm_trap_handler(int threadID){

...

}

...

void transaction_manager_stub(int dummy){

...

BEGIN_ESCAPE

asm volatile( \

"call %1\n" \

"mov %%g2, %%O0\n" \

"mov %%g3, %0\n" \

:"=r"(restart)

:"r"(&tm_trap_handler)

:"%o0", "%o7"

);

...

}

Reason:

The GCC inline assembly used on transaction_manager_stub() function does not

correctly include the thread id (register g2 sent from GEMS) value to tm_trap_handler function.

Solution:

The second assembly line of "mov %%g2, %%o0\n" should be moved before the "call %1\n" line.

Bug #2:

Software abort trap handler unrolling the trap handler's stack space.

Reason:

Eager VM logs thread's stack space and it becomes trouble when log unrolling takes place on abort trap handler's

own stack space. It will overwrite itself with log data and trap handler will act almost randomly.

Randomly writing other log's data or using random memory section as log may happen.

Steps leading to an error:

1. A transaction enters a function and writes some data to heap and stack.

2. The data gets written to the stack, heap and the log.

3. The transaction exits the function and stack pointer shrinks.

4. The transaction gets aborted due to conflict in heap area.

5. The software abort trap handler starts log unrolling from the shrunken stack pointer.

6. The undo log space and the stack space for the trap handler collides.

7. Trap handler unrolls the log and overwrites itself.

8. Corruption in trap handler's stack space cause random log unrolling.

9. System goes haywire.

Solution:

1. Take stack pointer just before starting trap handling.

2. Ignore undo log when the target data is pointed at the trap handler's stack area.

We can figure this out if the target address is between the current and old stack pointer.

Related Files and Code:

// ruby/microbenchmarks/transactional/common/transaction.c

void tm_unroll_log_entry(unsigned int* entry){

int k;

unsigned int *address = (unsigned int *) (*(entry+16) & ADDRESS_MASK);

for (k = 0; k < 16; k++){

// NOTE: This should be the place for checking writing to itself.

unsigned int data = "" + k);

*address = data;

address++;

}

Bug #3:

Not aborting the transaction when there is a conflict with non-TM or escape action.

Reason:

Due to 16 words cache granularity there is a possiblity of false conflict between a transaction

and non-transactional code. Current GEMS version just allow non-TM code to read the transactionally isolated line.

Steps leading to an error:

Step 1: thread 1's transaction trA writes x

Step 2: thread 2's non-transactional code reads y.

Unfortunately word x and word y shares the same cache line.

Step 3: thread 1 allows thread 2's read request.

Step 4: thread 2 starts a new transaction trB.

Step 5: trB reads line x, but does not report to L2 because it's in shared state.

It gets a simple line hit. Therefore trA is unaware of conflict.

Step 6: trA aborts

Step 7: trB writes z = x + y and commits. word z is on a different cache.

Solution:

When a non-TM read/write request comes and there is a hit on write-set perfect filter to a TM thread,

the transaction searches from the front for old log value and send it to the requestor instead

of sending the value from L1 cache.

After sending out the old value, the transaction removes the conflicted address from

the undo log and the write set. The transaction aborts.

Note: This sending out the old value and removing the conflicted address from the undo log

and the write set is important because we don't want to overwrite later what non-TM

might have written at the time of conflict. Also note that this sending out the old

log value has been practiced on STM compilers, too.

Related Files and Code:

// ruby/protocols/MESI_CMP_filter_directory-L1cache.sm

transition(M, Fwd_GETX, I)

{

d_sendDataToRequestor;

l_popRequestQueue;

}

transition(M, Inv, I)

{

f_sendDataToL2;

l_popRequestQueue;

}

- Byong-Wu Chong

[← Prev in Thread]	Current Thread	[Next in Thread→]
[Gems-users] Possible Bugs in EETM, BYONG WU CHONG <= Re: [Gems-users] Possible Bugs in EETM, Polina Dudnik Re: [Gems-users] Possible Bugs in EETM, BYONG WU CHONG Re: [Gems-users] Possible Bugs in EETM, Polina Dudnik

Previous by Date:	[Gems-users] script to collect trace file, Yunlian Jiang
Next by Date:	Re: [Gems-users] How long does it take to run STAMP applications?, BYONG WU CHONG
Previous by Thread:	[Gems-users] L2 cache simulation, Jianghao Guo
Next by Thread:	Re: [Gems-users] Possible Bugs in EETM, Polina Dudnik
Indexes:	[Date] [Thread]

Mailing List Archives

Public Access

[Gems-users] Possible Bugs in EETM