I have modified opal to print out traces of all memory
instructions. I call a function of the sequencer within the execute stage
of both memop objects. This function prints out the following:
m_local_cycles (which I am currently treating as the time)
the address of the instruction
whether it is a store
and the address being accessed.
Each sequencer has its own file.
When I merge these files, and sort them based on processor/sequencer I
observe that there are long strings in which only one processor accesses
the cache. For instance, I start fmm -p4 (fast multipole method on four
processors from splash2), and do
c 1500000
to try to jump past some of the OS stuff. I then load ruby and opal and
initialize them and run
opal0.sim-step 5000000
This produces several very large traces. But sorting them and grouping
all adjacent memory accesses of the same processor as a single "string"
yields only 32 "strings", with an average length of 103,827 memory
accesses. In other words, it appears that two threads are never executing
at the same time. I get similar behavior from fft. Does anyone have any
idea what I am doing wrong?
- Sean