[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Can you call condor_reschedule too frequently?



 
> <snip>
> > after submitting the jobs and before entering the Condor 
> monitor loop. 
> > Should I worry about stress on the master? Can anyone 
> comment on why 
> > jobs are taking a long time to match when submitted from windows?
> 
> I'd be more worried about what the 'monitor loop'  is doing...
> 
> if the submitter machine is overloaded either generally or 
> from too many demands on the schedd's (single threaded) time 
> the other daemons in the farm will timeout their requests 
> (such as those from the negotiator).
> 
> What is your monitor loop doing - using condor_wait? calling 
> condor_q every so often? condor_history? scanning the job log?

This is a slightly modified Condor.pm module -- so it's watched the log
file. No stress on the schedd but it is taking CPU cycles on the same
machine.
 
> the submission of new jobs normally seems to trigger a 
> reschedule (anecdotal evidence) however the *release* of a 
> job doesn't - are you submitting on hold then releasing as 
> part of your scripted solution (I noticed this when I wrote a 
> c# wrapper round the command line)...

These are newly submitted jobs that are sent into the system un-held.
Although I have been considering a submit-held flow that un-holds jobs
so that no more than X jobs from any cluster are running at any one time
(or available to run). Thanks for the tip. I'll remember that
condor_reschedule is definitly needed after un-holding a job.
 
> Is the negotiation machine overloaded / taking too long going 
> through processes which can't run anyway...

No. The negotiator machine is a dual 866MHz PIII with 2GB of ram. The
linux performance monitors show the CPU load to be very low. RAM is
readily available. The machine does not appear to be under undue stress.

> Take a look in your negotiation logs and you should get some 
> clues as to why it takes so long.
> 
> I find that, with significant number of submitters the farm 
> is never going to spring into life since the overhead of 
> going round all the schedd's by the negotiator will always 
> add a little (and sometimes a
> lot) of latency in the order of a few minutes. If this bugs 
> you you may as well get used to it, radically shrink the 
> number of submitters or use something different. I didn't 
> write the system but given it's operational goals/history 
> (big farms, non heterogeneous, cycle stealing roots - hours / 
> days / months worth of jobs) and my perception of the 
> architecture it uses* I see no way for it to avoid this 
> latency on initiating a match/claim...

I think that this is just something I will need to explain to my users.
Who, incidentally, are being transitioned off a homegrown and aging
solution that unfortunatly did spring to life as soon as jobs entered an
empty system.

> You could layer your own schedulers on top and thus 
> permanently maintain the match and manage your own submission 
> process - I don't think this will gain you much for the 
> (massive) hassles it will cause.

Agreed. It's not something I want to consider. Although I am going to
play with condor_reschedule calls to see if does hurt. If I discover
anything it all I'll report it back to this conversation thread.

-Ian