Re: [Condor-users] odd systemic spike in job sys time, could it be NEGOTIATOR_INTERVAL-related?

On Jun 26, 2012, at 7:01 PM, Vlad wrote:

> I would like to understand whether the following phenomenon has something to do with Condor's NEGOTIATOR_INTERVAL parameter:
> - I have N nodes used as Condor slaves, with P cores at each node. Condor is configured to enforce core affinity and each node has a 1:1 mapping between a Condor slot and a single core. I submit J jobs as a cluster using expressions of the type 'requirements = TARGET.name == MY.TargetSlot' to control exactly where each job ends up. In general J > N*P, i.e. some jobs must wait before they can be scheduled.
> - The previous details may not all be significant, but otherwise all physical cores are equivalent in capabilities and all jobs are copies of the same process working on different chunks of fairly uniform data -- that is, I expect all jobs to take about the same time on average and consume about the same amount of system time.
> - at the end of each job it will execute getrusage(RUSAGE_SELF, ...) and log its user and system times taken, among other things.
> I've noticed the following over repeating runs:
> - the first N jobs in each group that can be scheduled (i.e. jobs with $(Process) in ranges 0...(N-1), N*P...N*P+(N-1), etc) have a spike in "sys time" statistic: it is 10-40 seconds higher than the typical value for any other job, 3-4 seconds. 
> Could this be something to do with waiting until the next negotiator round? And hence am I seeing some random fraction of NEGOTIATOR_INTERVAL (60 sec currently) accrued by the condor_starter before it clones the real job process? (Sorry if I am getting any details of Condor architecture wrong here). I would like to eliminate this delay if I could.

Once a job is matched to a slot, the negotiator has no bearing on the running of the job. The condor_starter and condor_shadow (which handle the actual running of the job) don't talk to the negotiator. The condor_starter doesn't do anything significant between forking a process to start the job and exec'ing the job's executable.

I can't think of anything Condor does that would explain the spike in process system time you're seeing, if you're using the vanilla universe. If your jobs are standard universe, then they are communicating with the condor_shadow for remote system calls.

