[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] idle + claimed, Ian Chesel?
- Date: Wed, 30 Mar 2005 15:01:21 -0700
- From: "Zachary Stauber" <zstauber@xxxxxxxxx>
- Subject: RE: [Condor-users] idle + claimed, Ian Chesel?
This is good advice. I'll see if I can't set up a couple extra schedds
on the submitting/master machine, and I'll see if I can't get
copy_to_spool to false.
I think the shadows themselves are low enough. We only have 140 vm's
and each condor_shadow takes about 3 MB of memory (so up to 720 MB
total), and we have well over 2 GB of RAM on this machine, although I
have noticed on weaker machines that when the memory maxes out it starts
crashing during submits.
Also thanks for the advice on a less invasive condor_q type command. I
do have it only transfer back the files I need (plus .log and .err and
.out) because by default it tries to transfer back all my intermediate
temporary files makes a mess.
If I had a wish list for Condor though I wish I could tell it where to
PUT the files it transfers back, since usually I have a data directory
that gets backed up, and I don't want all my Condor stuff and
executables in there.
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Matt Hope
Sent: Wednesday, March 30, 2005 1:47 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] idle + claimed, Ian Chesel?
On Wed, 30 Mar 2005 00:09:49 -0700, Zachary Stauber <zstauber@xxxxxxxxx>
> Well, we switched the condor master to a quad processor with gigabit
> Ethernet running Windows 2000 Server, and we're also sending jobs from
> (and only there), but it doesn't start the startd so it cannot run
> Unfortunately we're still having the same problem. None of the
> ever peaks, even during the longest queue of jobs, and the network
> never registers over 25% or 30% of its total available bandwidth.
> I'm sort of stumped since the negotiator and schedd's on the
> master/submitter never even register 25% on a processor in the
> monitor, nothing else is running on that machine, and the network card
> runs out of bandwidth. However, I'll change the
> setting you mentioned.
Note that just because the submit machine is beefy it won't solve the
underlying single threaded issue, merely delay it (with diminishing
returns as the network/disk io starts to overwhelm the cpu factor.
Condor performs best when there are a reasonable number of schedds
with a reasonable distribution of jobs. If your jobs are responding to
the preemption signal by terminating (or run so fast that a fair
number of the jobs will terminate during the gap between the vacate
signal and kill signal then they will be 'checkpointed' by the schedd
which involves transferring the entire contents of the working dir
back to the submitting machine. If this is big this can take a long
time and while it is happening the schedd will not do anything else...
I suggest you do what Ian suggested and use multiple schedd's on one
machine to give yourself at least some tolerance for long running
operations (even if this will only kick in if you get lucky and the
other schedd's jobs can take up the slack.
The load on your quad cpu box will be as follows:
1) Occasional spikes of cpu activity from the negotiator.
2) Under high loads (many concurrent jobs) a reasonably significant
load on the schedd
3) Not a lot of effort from the collector
The shadows will take up memory space and windows resources
(significant when you go beyond 100 shadows) but very little cpu.
The quad box would be more efficiently loaded if you had 2 (maybe even
3) schedd's where users are split between them.
Other things to avoid in a single/limited schedd environment:
* Running condor_q a lot (this massively slows things down and users
may not realise this)
Using condor_status -schedd may be sufficient info for you and goes to
the collector not the schedd.
* transferring more files back than you need - use sub directories
(which don't get transferred back) or clean up pre finish. better
still do not transfer any files back except those you explicitly need.
This one is a guess but I believe the schedd is responsible for
performing the copy
* Lots of submissions which use copy_to_spool=true (the default) when
Condor-users mailing list