[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] idle + claimed, Ian Chesel?

On Wed, 30 Mar 2005 00:09:49 -0700, Zachary Stauber <zstauber@xxxxxxxxx> wrote:
> Well, we switched the condor master to a quad processor with gigabit
> Ethernet running Windows 2000 Server, and we're also sending jobs from there
> (and only there), but it doesn't start the startd so it cannot run jobs. 
> Unfortunately we're still having the same problem.  None of the processors
> ever peaks, even during the longest queue of jobs, and the network card
> never registers over 25% or 30% of its total available bandwidth.
> I'm sort of stumped since the negotiator and schedd's on the
> master/submitter never even register 25% on a processor in the performance
> monitor, nothing else is running on that machine, and the network card never
> runs out of bandwidth.  However, I'll change the PREEMPTION_REQUIREMENTS
> setting you mentioned.

Note that just because the submit machine is beefy it won't solve the
underlying single threaded issue, merely delay it (with diminishing
returns as the network/disk io starts to overwhelm the cpu factor.

Condor performs best when there are a reasonable number of schedds
with a reasonable distribution of jobs. If your jobs are responding to
the preemption signal by terminating (or run so fast that a fair
number of the jobs will terminate during the gap between the vacate
signal and kill signal then they will be 'checkpointed' by the schedd
which involves transferring the entire contents of the working dir
back to the submitting machine. If this is big this can take a long
time and while it is happening the schedd will not do anything else...

I suggest you do what Ian suggested and use multiple schedd's on one
machine to give yourself at least some tolerance for long running
operations (even if this will only kick in if you get lucky and the
other schedd's jobs can take up the slack.

The load on your quad cpu box will be as follows:

1) Occasional spikes of cpu activity from the negotiator.
2) Under high loads (many concurrent jobs) a reasonably significant
load on the schedd
3) Not a lot of effort from the collector

The shadows will take up memory space and windows resources
(significant when you go beyond 100 shadows) but very little cpu.

The quad box would be more efficiently loaded if you had 2 (maybe even
3) schedd's where users are split between them.

Other things to avoid in a single/limited schedd environment:

* Running condor_q a lot (this massively slows things down and users
may not realise this)
Using condor_status -schedd may be sufficient info for you and goes to
the collector not the schedd.

* transferring more files back than you need - use sub directories
(which don't get transferred back) or clean up pre finish. better
still do not transfer any files back except those you explicitly need.

This one is a guess but I believe the schedd is responsible for
performing the copy
* Lots of submissions which use copy_to_spool=true (the default) when not needed