[Gems-users] L1 hit latency


Date: Sat, 11 Nov 2006 10:57:21 -0500
From: "Mojtaba Mehrara" <mehrara@xxxxxxxxx>
Subject: [Gems-users] L1 hit latency
Hi,
I am trying to increase the L1 hit latency in the MSI_MOSI_CMP_directory protocol. I did what Mike said in the following post:
"
The L1_RESPONSE_LATENCY, like most of the specified latencies, are specific to an individual protocol.  Adjusting the L1 hit latency is unfortunately not at all straightforward.  By default, he L1 hit latency is always 1 cycle.  This can be changed by turning off "fast path hits", controlled by the REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH flag.  A fast path hit is where the Ruby sequencer (ruby/sequencer.C) directly checks the permissions in the L1 caches before actually issuing a request to Ruby.  If you turn this off, the L1 hit latency can be controlled by the SEQUENCER_TO_CONTROLLER_LATENCY parameter.
 
Sorry this is confusing...hopefully we can clean this up in the future.   
 
--Mike
 "
 
However I notice nearly no change in Ruby_cycles when I increase SEQUENCER_TO_CONTROLLER_LATENCY from 2 to 11.
 
Followings are my other parameters.( I have unrealistically set some delays to 1 to minimize their effect.)
 


Ruby Configuration
------------------
protocol: MSI_MOSI_CMP_directory
simics_version: Simics 3.0.22
compiled_at: 02:46:55, Nov 11 2006
RUBY_DEBUG: false
g_RANDOM_SEED: 1
g_DEADLOCK_THRESHOLD: 50000
g_FORWARDING_ENABLED: false
RANDOMIZATION: false
g_SYNTHETIC_DRIVER: false
g_SYNTHETIC_GENERATOR: locks
g_DETERMINISTIC_DRIVER: false
g_FILTERING_ENABLED: false
g_DISTRIBUTED_PERSISTENT_ENABLED: true
g_DYNAMIC_TIMEOUT_ENABLED: true
g_RETRY_THRESHOLD: 1
g_FIXED_TIMEOUT_LATENCY: 300
g_trace_warmup_length: 1000000
g_bash_bandwidth_adaptive_threshold: 0.75
g_tester_length: 0
g_synthetic_locks: 2048
g_deterministic_addrs: 1
g_SpecifiedGenerator: DetermInvGenerator
g_callback_counter: 0
g_NUM_COMPLETIONS_BEFORE_PASS: 0
g_think_time: 5
g_hold_time: 5
g_wait_time: 5
PROTOCOL_DEBUG_TRACE: true
DEBUG_FILTER_STRING: none
DEBUG_VERBOSITY_STRING: none
DEBUG_START_TIME: 0
DEBUG_OUTPUT_FILENAME: none
SIMICS_RUBY_MULTIPLIER: 2
OPAL_RUBY_MULTIPLIER: 2
TRANSACTION_TRACE_ENABLED: false
USER_MODE_DATA_ONLY: false
PROFILE_HOT_LINES: false
PROFILE_ALL_INSTRUCTIONS: false
PRINT_INSTRUCTION_TRACE: false
BLOCK_STC: false
PROTOCOL_DEBUG_TRACE: true
DEBUG_FILTER_STRING: none
DEBUG_VERBOSITY_STRING: none
DEBUG_START_TIME: 0
DEBUG_OUTPUT_FILENAME: none
SIMICS_RUBY_MULTIPLIER: 2
OPAL_RUBY_MULTIPLIER: 2
TRANSACTION_TRACE_ENABLED: false
USER_MODE_DATA_ONLY: false
PROFILE_HOT_LINES: false
PROFILE_ALL_INSTRUCTIONS: false
PRINT_INSTRUCTION_TRACE: false
BLOCK_STC: false
PERFECT_MEMORY_SYSTEM: false
PERFECT_MEMORY_SYSTEM_LATENCY: 0
DATA_BLOCK: false
REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH: true
g_SIMICS: true
L1_CACHE_ASSOC: 4
L1_CACHE_NUM_SETS_BITS: 8
L2_CACHE_ASSOC: 8
L2_CACHE_NUM_SETS_BITS: 10
g_MEMORY_SIZE_BYTES: 4294967296
g_DATA_BLOCK_BYTES: 64
g_PAGE_SIZE_BYTES: 4096
g_NUM_PROCESSORS: 8
g_NUM_L2_BANKS: 4
g_NUM_MEMORIES: 4
g_PROCS_PER_CHIP: 8
g_NUM_CHIPS: 1
g_NUM_CHIP_BITS: 0
g_MEMORY_SIZE_BITS: 32
g_DATA_BLOCK_BITS: 6
g_PAGE_SIZE_BITS: 12
g_NUM_PROCESSORS_BITS: 3
g_PROCS_PER_CHIP_BITS: 3
g_NUM_L2_BANKS_BITS: 2
g_NUM_L2_BANKS_PER_CHIP_BITS: 2
g_NUM_L2_BANKS_PER_CHIP: 4
g_NUM_MEMORIES_BITS: 2
g_NUM_MEMORIES_PER_CHIP: 4
g_MEMORY_MODULE_BITS: 24
g_MEMORY_MODULE_BLOCKS: 16777216
MAP_L2BANKS_TO_LOWEST_BITS: true
DIRECTORY_CACHE_LATENCY: 1
NULL_LATENCY: 1
ISSUE_LATENCY: 2
CACHE_RESPONSE_LATENCY: 1
L2_RESPONSE_LATENCY: 22
L1_RESPONSE_LATENCY: 1
COLLECTOR_REQUEST_LATENCY: 1
MEMORY_RESPONSE_LATENCY_MINUS_2: 118
DIRECTORY_LATENCY: 1
NETWORK_LINK_LATENCY: 1
COPY_HEAD_LATENCY: 1
ON_CHIP_LINK_LATENCY: 1
RECYCLE_LATENCY: 1
L2_RECYCLE_LATENCY: 1
TIMER_LATENCY: 10000
TBE_RESPONSE_LATENCY: 1
PERIODIC_TIMER_WAKEUPS: true
LOG_BASE: 4294967296
RETRY_LATENCY: 100
RESTART_DELAY: 1000
PROFILE_EXCEPTIONS: false
PROFILE_XACT: false
XACT_NUM_CURRENT: 0
XACT_LAST_UPDATE: 0
L1_REQUEST_LATENCY: 1
L2_REQUEST_LATENCY: 1
SINGLE_ACCESS_L2_BANKS: true
SEQUENCER_TO_CONTROLLER_LATENCY: 11
L1CACHE_TRANSITIONS_PER_RUBY_CYCLE: 32
L2CACHE_TRANSITIONS_PER_RUBY_CYCLE: 32
DIRECTORY_TRANSITIONS_PER_RUBY_CYCLE: 32
COLLECTOR_TRANSITIONS_PER_RUBY_CYCLE: 32
g_SEQUENCER_OUTSTANDING_REQUESTS: 16
NUMBER_OF_TBES: 128
NUMBER_OF_MATES: 4
NUMBER_OF_L1_TBES: 32
NUMBER_OF_L2_TBES: 32
FINITE_BUFFERING: false
FINITE_BUFFER_SIZE: 3
PROCESSOR_BUFFER_SIZE: 10
PROTOCOL_BUFFER_SIZE: 32
TSO: false
g_MASK_PREDICTOR_CONFIG: AlwaysBroadcast
g_TOKEN_REISSUE_THRESHOLD: 2
g_PERSISTENT_PREDICTOR_CONFIG: None
g_NETWORK_TOPOLOGY: PT_TO_PT
g_CACHE_DESIGN: NUCA
g_endpoint_bandwidth: 1000
g_adaptive_routing: true
NUMBER_OF_VIRTUAL_NETWORKS: 5
FAN_OUT_DEGREE: 4
g_PRINT_TOPOLOGY: true
g_NUM_DNUCA_BANK_SETS: 32
g_NUM_DNUCA_BANK_SET_BITS: 0
g_NUM_BANKS_IN_BANK_SET_BITS: 0
g_NUM_BANKS_IN_BANK_SET: 0
PERFECT_DNUCA_SEARCH: true
g_NUCA_PREDICTOR_CONFIG: NULL
ENABLE_MIGRATION: false
ENABLE_REPLICATION: false
COLLECTOR_HANDLES_OFF_CHIP_REQUESTS: false
XACT_LENGTH: 0
XACT_SIZE: 0

By tracking down a specific trace in tester.exec, I noticed that L1_REQUEST_LATENCY and L1_RESPONSE_LATENCY are the delays between L1 and L2 and have nothing to do with L1 hit latency itself. Is this correct?(I have tried increasing these two anyways, but I still didn't notice much difference in perfomance)

Am I missing something here?

One more thing. As one on the previous posts noted, I tried to get the L1 miss rate by commenting out the follwoing line in system/Sequencer.C.

 // if (!REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH) {
      g_system_ptr->getProfiler()->addPrimaryStatSample(msg, m_chip_ptr->getID());
 But the reported miss rates are pretty high on Splash2 benchmarks.(more than 90% !!) Is it possible that this is the source of my problem in L1 hit latency? If so, what should I do and how should I measure the actual miss rate?

 

Thanks in advance,

Mojtaba

 
[← Prev in Thread] Current Thread [Next in Thread→]