[Gems-users] Fwd: Re: LogTM not aborting


Date: Wed, 12 Dec 2007 02:52:19 -0500
From: imukherj@xxxxxxxxxxxxxxxx
Subject: [Gems-users] Fwd: Re: LogTM not aborting


----- Forwarded message from imukherj@xxxxxxxxxxxxxxxx -----
   Date: Tue, 11 Dec 2007 00:39:19 -0500
   From: imukherj@xxxxxxxxxxxxxxxx
Reply-To: imukherj@xxxxxxxxxxxxxxxx
Subject: Re: [Gems-users] LogTM not aborting
     To: Jayaram Bobba <bobba@xxxxxxxxxxx>

Hi Jayaram,
Thanks a lot for your reply. I tried to find out what happens to the first thread, and so extracted some other messages in the log file that were about something other than the second thread being NACKed by the first. Here are the last few debug messages of this kind, in reverse order of time, and not necessarily consecutive (most of the lines have a timestamp at the beginning, so you can roughly make out when things happened). It seems like something has gone wrong, but we are new to simics and couldn't figure out the cause. Any help will be highly appreciated. We were using scripts to automate our simulations.


debugger entered.
idle_other_cpus: cpu id 3 failed to stop: state 8
ERROR: Non-printable char (0x0) on console.
idle_a_cpu: grab_cpu 3 failed
ERROR: Non-printable char (0x0) on console.
idle_other_cpus: cpu id 2 failed to stop: state 1
ERROR: Non-printable char (0x0) on console.
idle_a_cpu: grab_cpu 2 failed
ERROR: Non-printable char (0x0) on console.
idle_other_cpus: cpu id 0 failed to stop: state 8
ERROR: Non-printable char (0x0) on console.
idle_a_cpu: grab_cpu 0 failed
ERROR: Non-printable char (0x0) on console.
Type 'go' to resume
panic: entering debugger (continue to save dump)

 %l4-7: 0000000000000000 0000000000000000 00000000ff3a2400 000000007e76b885
 %l0-3: 0000000000000000 0000000000012000 0000000000000000 0000000000000000
000002a10034fa20 unix:cbe_level1+8 (0, 0, 30002e64000, 2a10034fd98, 400740, 100c274)
 %l4-7: 00000300028ee5b0 00000000011ede94 0000000000000000 00000000000004b6
 %l0-3: 00000300028ee5c0 00000300001d50e8 0000000000000000 0000000000000003
000002a10034f960 genunix:cyclic_softint+a4 (300001d50f0, 3000277e330, 0, 300001d50c0, 300028ee5b8, 3000277e330)
 %l4-7: 00000300028ee5c4 0000000000000007 0000000001884a5c 000002a100347de8
 %l0-3: 0000000000000004 00000300028ee5c0 000003000277e330 0000000000000007
000002a10034f8b0 SUNW,UltraSPARC-III+:cheetah_nudge_buddy+50 (0, 300028be000, 30002e64000, 30002e64000, 0, 2)
 %l4-7: 0000000000000000 0000000000000004 0000000000000002 000002a10034f8b0
 %l0-3: 0000000000000001 0000000001884f24 0000000000000001 000002a10034f868
000002a10034f7b0 unix:xt_one_unchecked+c8 (2, 1039bec, 0, 0, 0, 1)
 %l4-7: 0000000000000000 00000000011ff800 00000000f60dd456 00000000f60dd471
 %l0-3: 0000000000000000 0002000000000000 0000000000000001 0000000000441516
000002a10034f700 SUNW,UltraSPARC-III+:send_one_mondo+160 (2, 2, df03, 2, 18adb88, 1)


panic: failed to stop cpu2
170695021 1 [1,0] TID 0 XACT LOAD L1 OVERFLOW 0 ADDR [0x224401c0, line 0x224401c0] 145223587 1 [1,0] TID 0 XACT STORE L1 OVERFLOW 0 ADDR [0x18a3ff40, line 0x18a3ff40] 145222751 1 [1,0] TID 0 XACT STORE L1 OVERFLOW 0 ADDR [0x18a3ff80, line 0x18a3ff80] 120134775 1 [1,0] TID 0 XACT LOAD L1 OVERFLOW 0 ADDR [0x1a06f740, line 0x1a06f740] 120130606 1 [1,0] TID 0 XACT STORE L1 OVERFLOW 0 ADDR [0x1a06f880, line 0x1a06f880] 120129774 1 [1,0] TID 0 XACT STORE L1 OVERFLOW 0 ADDR [0x1a06f8c0, line 0x1a06f8c0] 120128941 1 [1,0] TID 0 XACT STORE L1 OVERFLOW 0 ADDR [0x1a06f900, line 0x1a06f900] 120128106 1 [1,0] TID 0 XACT STORE L1 OVERFLOW 0 ADDR [0x1a06f940, line 0x1a06f940] 119892717 1 [1,0] TID 0 XACT STORE L1 OVERFLOW 0 ADDR [0x18a3ff80, line 0x18a3ff80] 119679808 1 [1,0] TID 0 XACT STORE L1 OVERFLOW 0 ADDR [0x18a3ff40, line 0x18a3ff40] 119671353 1 [1,0] TID 0 XACT LOAD L1 OVERFLOW 0 ADDR [0x224401c0, line 0x224401c0] 119171873 1 [1,0] TID 0 XACT STORE L1 OVERFLOW 0 ADDR [0x1a06f940, line 0x1a06f940]
25267859 1 [1,0] Begin ESCAPE ACTION - ESCAPE DEPTH: 1
25267859 1 [1,0] TRAP TO HANDLER: TID: 0 TRAP_TYPE 5 TRAP ADDRESS 0x0 NUM_RETRIES 0 LOG_SIZE 476 XACT_LEVEL 1 XACT_LOWEST_CONFLICT_LEVEL 1 Handler Address = [0x12718, line 0x12700] PC = [0x100707c, line 0x1007040] 25267855 1 [1,0] CAN RETIRE [0x1a06f750, line 0x1a06f740] RESULT: 1 PC = [0x11378, line 0x11340] 25267855 1 [1,0] SETTING INTERRUPT TRAP 25267855 1 [1,0] READY TO RETIRE MEM REF[0x1a06f750, line 0x1a06f740] type LD_XACT PC = [0x11378, line 0x11340] 25267855 1 [1,0] COMPLETED MEM REF[0x1a06f750, line 0x1a06f740] type LD_XACT PC = [0x11378, line 0x11340] 25267855 1 [1,0] ISOLATE XACT LOAD VA: [0x23740, line 0x23740] PA: [0x1a06f740, line 0x1a06f740] XACT LEVEL: 1 PC = [0x11378, line 0x11340] 25267855 1 [1,0] TID 0 XACT LOAD 0 [0x1a06f750, line 0x1a06f740] VA [0x23750, line 0x23740] PC [0x11378, line 0x11340] VAL 0x3e8 25267851 1 [1,0] CAN RETIRE [0x18a3ff7c, line 0x18a3ff40] RESULT: 0 PC = [0x11374, line 0x11340] 25267851 1 [1,0] READY TO RETIRE MEM REF[0x18a3ff7c, line 0x18a3ff40] type LD_XACT PC = [0x11374, line 0x11340] 25267851 1 [1,0] COMPLETED MEM REF[0x18a3ff7c, line 0x18a3ff40] type LD_XACT PC = [0x11374, line 0x11340] 25267851 1 [1,0] ISOLATE XACT LOAD VA: [0xff1fbf40, line 0xff1fbf40] PA: [0x18a3ff40, line 0x18a3ff40] XACT LEVEL: 1 PC = [0x11374, line 0x11340] 25267851 1 [1,0] TID 0 XACT LOAD 0 [0x18a3ff7c, line 0x18a3ff40] VA [0xff1fbf7c, line 0xff1fbf40] PC [0x11374, line 0x11340] VAL 0x3e3 25267846 1 [1,0] CAN RETIRE [0x18a3ff7c, line 0x18a3ff40] RESULT: 0 PC = [0x113bc, line 0x11380] 25267846 1 [1,0] LOGGING STORE: [0xff1fbf40, line 0xff1fbf40] 0 PC = [0x113bc, line 0x11380] 25267846 1 [1,0] READY TO RETIRE MEM REF[0x18a3ff7c, line 0x18a3ff40] type ST_XACT PC = [0x113bc, line 0x11380] 25267846 1 [1,0] COMPLETED MEM REF[0x18a3ff7c, line 0x18a3ff40] type ST_XACT PC =


Thanks,
Indraneel

Quoting Jayaram Bobba <bobba@xxxxxxxxxxx>:

Since possibleCycle is 0, it means that there is no possible TM deadlock
(2 is waiting on 1 but 1 is not waiting on 2).
However processor 1 might be doing something weird.
Can you determine what the remote thread is doing?
Could you disassemble the remote PC 0xf000613c and see what it
corresponds to?

Jayaram

Indraneel Mukherjee wrote:
Hi all,
  I recently wrote a stresscode for logTM on simics-2.2.19 and
gems-2.0, on a 4 processor machine. The stresscode runs two threads
generating many conflicting transactions that would deadlock unless
one of the transactions aborts. On running it on logTM with the
XACT_CONFLICT_RES flag set to BASE or TIMESTAMP, however, conflicts
are detected but none of the transactions are aborted. As such the
code runs on indefinitely; a similar code that uses locks to replace
transactions by serialized access to the critical section runs in
about 20 minutes. Does anyone have any idea why logTM is not
aborting?The last few lines of debugging output looks like the
following when the stresscode is run for a long time:

1067372669   2 [2,0] TID 1 XACT NACK 0 by 1 [ 1, 0 ] NID: 0
[0x1a06f898, line 0x1a06f880] VA [0x0, line 0x0]  PC [0x112ec, line
0x112c0]  my_ts 1067372155  nack_ts 23405901 possible_cycle 0
 1067372670 2 [2,0] CONFLICTING REQUEST [0x1a06f898, line 0x1a06f880]
possibleCycle: 0 shouldTrap: 0
 1067373184 2 [2,0] RECEIVED NACK [0x1a06f898, line 0x1a06f880]
remote_id: 1 [1,0] myPC: [0x112ec, line 0x112c0] remotePC:
[0xf000613c, line 0xf0006100] local_timestamp: 1067372670
remote_timestamp: 23405901
1067373184   2 [2,0] TID 1 XACT NACK 0 by 1 [ 1, 0 ] NID: 0
[0x1a06f898, line 0x1a06f880] VA [0x0, line 0x0]  PC [0x112ec, line
0x112c0]  my_ts 1067372670  nack_ts 23405901 possible_cycle 0
 1067373185 2 [2,0] CONFLICTING REQUEST [0x1a06f898, line 0x1a06f880]
possibleCycle: 0 shouldTrap: 0
 1067373699 2 [2,0] RECEIVED NACK [0x1a06f898, line 0x1a06f880]
remote_id: 1 [1,0] myPC: [0x112ec, line 0x112c0] remotePC:
[0xf0031298, line 0xf0031280] local_timestamp: 1067373185
remote_timestamp: 23405901
1067373699   2 [2,0] TID 1 XACT NACK 0 by 1 [ 1, 0 ] NID: 0
[0x1a06f898, line 0x1a06f880] VA [0x0, line 0x0]  PC [0x112ec, line
0x112c0]  my_ts 1067373185  nack_ts 23405901 possible_cycle 0
 1067373700 2 [2,0] CONFLICTING REQUEST [0x1a06f898, line 0x1a06f880]
possibleCycle: 0 shouldTrap: 0
 1067374214 2 [2,0] RECEIVED NACK [0x1a06f898, line 0x1a06f880]
remote_id: 1 [1,0] myPC: [0x112ec, line 0x112c0] remotePC:
[0xf0031dec, line 0xf0031dc0] local_timestamp: 1067373700
remote_timestamp: 23405901
1067374214   2 [2,0] TID 1 XACT NACK 0 by 1 [ 1, 0 ] NID: 0
[0x1a06f898, line 0x1a06f880] VA [0x0, line 0x0]  PC [0x112ec, line
0x112c0]  my_ts 1067373700  nack_ts 23405901 possible_cycle 0
 1067374215 2 [2,0] CONFLICTING REQUEST [0x1a06f898, line 0x1a06f880]
possibleCycle: 0 shouldTrap: 0
 1067374729 2 [2,0] RECEIVED NACK [0x1a06f898, line 0x1a06f880]
remote_id: 1 [1,0] myPC: [0x112ec, line 0x112c0] remotePC:
[0xf00306c4, line 0xf00306c0] local_timestamp: 1067374215
remote_timestamp: 23405901
1067374729   2 [2,0] TID 1 XACT NACK 0 by 1 [ 1, 0 ] NID: 0
[0x1a06f898, line 0x1a06f880] VA [0x0, line 0x0]  PC [0x112ec, line
0x112c0]  my_ts 1067374215  nack_ts 23405901 possible_cycle 0
 1067374730 2 [2,0] CONFLICTING REQUEST [0x1a06f898, line 0x1a06f880]
possibleCycle: 0 shouldTrap: 0
 1067375244 2 [2,0] RECEIVED NACK [0x1a06f898, line 0x1a06f880]
remote_id: 1 [1,0] myPC: [0x112ec, line 0x112c0] remotePC:
[0xf0030eb0, line 0xf0030e80] local_timestamp: 1067374730
remote_timestamp: 23405901

In short, the same thread keeps receiving NACKs within a transaction
and never proceeds.
I'm also including my stresscode, just in case:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include <pthread.h>
#include "transaction.h"

#define MAGIC_BREAK do {                                   \
       __asm__ __volatile__ ("sethi 0x40000, %g0");     \
} while (0);

#define ARR_SIZE 1000000

int num_loops;
int arr_size;
int reps;
int array[ARR_SIZE];

void* slaveStart(void *id){

  int myid;
  int tmp;
  int i,j,k;
  int local_sense = 0;
  int dummy;

  myid = *((int*)id);

  tm_bind_to_cabinet(myid+1);
  Barrier_breaking(&local_sense, myid, 2);
  set_transaction_registers(myid);

  /* warm-up caches */
  for (i=0; i<arr_size; ++i)
    dummy = array[i];

  for (i=0; i<num_loops; ++i){
    BEGIN_WORKLOAD_TRANSACTION
    BEGIN_TRANSACTION(0)
      if (myid == 0)
        for (j=0;j<arr_size;++j)
          for (k=0;k<reps;++k)
            ++array[j];
      else
        for (j=arr_size-1;j>=0;--j)
          for (k=0;k<reps;++k)
            --array[j];
    COMMIT_TRANSACTION(0)
    END_WORKLOAD_TRANSACTION
      }

  Barrier_breaking(&local_sense, myid, 2);
}

int main(int argc, char **argv){

  int i;
  pthread_t threads[2];
  pthread_attr_t pthread_custom_attr;
  int  id[2] = {0,1};


  if (argc < 4) {
    fprintf (stderr, "./aborter loop-size array-size reps");
    return (1);
  }

  num_loops = atoi(argv[1]);
  arr_size  = atoi(argv[2]);
  reps = atoi(argv[3]);

  // init array
  for (i=0; i<ARR_SIZE; ++i)
    array[i] = 0;


  init_transaction_state(2);    // #threads = 2
  MAGIC_BREAK ;
  pthread_attr_init(&pthread_custom_attr); // get default attributes
  pthread_create(&threads[0], &pthread_custom_attr, slaveStart,
(void*)&id[0]);
  pthread_create(&threads[1], &pthread_custom_attr, slaveStart,
(void*)&id[1]);

  // Synchronize on termination
  pthread_join(threads[0],NULL);
  pthread_join(threads[1],NULL);
  MAGIC_BREAK ;

  return (0);
}





------------------------------------------------------------------------

_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.


_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.






----- End forwarded message -----


[← Prev in Thread] Current Thread [Next in Thread→]