[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] odd systemic spike in job sys time, could it be NEGOTIATOR_INTERVAL-related?



Greetings,

I would like to understand whether the following phenomenon has something to do with Condor's NEGOTIATOR_INTERVAL parameter:

- I have N nodes used as Condor slaves, with P cores at each node. Condor is configured to enforce core affinity and each node has a 1:1 mapping between a Condor slot and a single core. I submit J jobs as a cluster using expressions of the type 'requirements = TARGET.name == MY.TargetSlot' to control exactly where each job ends up. In general J > N*P, i.e. some jobs must wait before they can be scheduled.

- The previous details may not all be significant, but otherwise all physical cores are equivalent in capabilities and all jobs are copies of the same process working on different chunks of fairly uniform data -- that is, I expect all jobs to take about the same time on average and consume about the same amount of system time.

- at the end of each job it will execute getrusage(RUSAGE_SELF, ...) and log its user and system times taken, among other things.

I've noticed the following over repeating runs:

- the first N jobs in each group that can be scheduled (i.e. jobs with $(Process) in ranges 0...(N-1), N*P...N*P+(N-1), etc) have a spike in "sys time" statistic: it is 10-40 seconds higher than the typical value for any other job, 3-4 seconds.

Could this be something to do with waiting until the next negotiator round? And hence am I seeing some random fraction of NEGOTIATOR_INTERVAL (60 sec currently) accrued by the condor_starter before it clones the real job process? (Sorry if I am getting any details of Condor architecture wrong here). I would like to eliminate this delay if I could.

Regards,
Vlad