[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_shadow start rate



We are seeing an odd issue on EC2 where we can't start more than 3
jobs per second. We first ruled out the negotiator being the
bottleneck; the negotiator had no problem matching jobs. It seems that
we can't start more than 3 condor_shadow processes per second on a
single schedd. We discovered that JOB_START_COUNT was set to 3, but
increasing that value and restarting the schedd did not give us more
shadow starts per second. (Lowering the value did reduce our shadow
start rate as expected.) This is a c3.8xlarge instance, which  is a
32-core Intel Ivy Bridge processor with 60 GB of RAM and an SSD.
Nothing we've found suggests we are host resource constrained in any
way.

We initially saw this on a customer's 8000-core production instance,
but we've recreated a toy example (with the same server specs, but a
pool size of 64 and very short-running jobs) that shows the same
behavior. With the default of shadow reuse, we see nine jobs start per
second (which matches the completion rate of jobs). When we disabled
it in order to test a cold start of the cluster, it only can start
three jobs per second.

Is this an expected limit? Has anyone else seen any issues with slow
shadow starts?


Thanks,
BC

(P.S. Running HTCondor 8.0.5. CM and scheduler are on CentOS 6
instances and execute nodes are Windows Server 2008.)

-- 
Ben Cotton
main: 888.292.5320

Cycle Computing
Leader in Utility HPC Software

http://www.cyclecomputing.com
twitter: @cyclecomputing