[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] scaling problems with condor_negotiator



Hi,

We have a pool of ~ 300 win xp execute hosts here with a separate
central
manager and submit host (both Sun blades). Initially when users started
submitting large clusters of jobs ( as many as 20 000 ) the load on the
submit host went through the roof. To get around this I created a
wrapper 
for condor_submit to restrict the number of idle jobs in the queue to 
around 500 - 1000. I've noticed now though that this seems to have
shifted 
the load onto the load central manager with the condor_negotiator taking

~ 80 % of the cpu for significant periods (it used to be around 1 %). 
I'm also seeing large numbers of idle execute hosts despite there being 
an excess of queued jobs and the throughput is suffering as a result. 
So I'm wondering ...

How does the load on the negotiator scale with no of execute hosts and 
no of idle jobs ? Presumably having jobs from different users in the
queue places an additional load since it has to juggle the priorities
(using a fair share setup).

Does the negotiator treat a cluster as a single entity since all its
component process have the same requirements hence putting a smaller
load on it (I moved from cluster to individual jobs with the wrapper
script).

Would tweaking the negotiator config parameters improve the thoughput -
if
so which ones. The negotiator activity seems pretty bursty so would
having
more frequent negotiation cycles help.

I should point out that the jobs last for around 20 mins each so that's
~ 15 starting/stopping every minute assuming an even distribution. 

Apologies for the large number of questions but I'm really trying to 
get a handle on what are the limiting factors here.

regards,

-ian.

Dr Ian C. Smith
e-Science Team,
University of Liverpool,
Computing Services Department.