[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] jobs stop running when lots of people submit jobs
- Date: Tue, 20 May 2008 08:14:04 -0500
- From: Dan Bradley <dan@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] jobs stop running when lots of people submit jobs
The problem you describe sounds like a problem that was fixed in 7.0.0.
Here's the entry in the 7.0.0 version history:
Fixed a problem in the /condor_ negotiator/ in which machines go
unassigned when user priorities result in the machines getting split
into shares that are rounded down to 0. For example if there are 10
machines and 100 equal priority submitters, then each submitter was
getting 0.1 machines, which got rounded down to 0, so no machines were
assigned to anybody. The message in the /condor_ negotiator/ log in this
case was this:
Over submitter resource limit (0) ... only consider startd ranks
I hope that solves your problem!
Ben Clifford wrote:
I have a condor installation which I use for training.
Sometimes when in use, it stops running jobs, with those jobs appearing
2 match but reject the job for unknown reasons
When I attempt to put load on a fresh installation, both with condor jobs
and with non-condor jobs, both from my own account and from several
accounts at once, I cannot get this problem to reappear; but as soon as
students start using it, the problems start (even to the extent that my
test load scripts will be running in a loop happily for hours and then
stop around the time students start)
So the only mechanism I have for recreating it at the moment is to point a
room full of students at it (which is not an easily repeatable action).
This has happened a few times, but I now have an install that is in this
state and still online rather than being taken down right after a
This is using condor-6.8.4. Condor-G works OK submitting to Globus on
other machines, but local execution through the vanilla universe does not
(using a variety of submission mechanisms - through GRAM2, through
condor_run, condor_submit, dagman).
I don't see anything in the logs that indicates what is causing this
problem - does anyone have any advice about what I can look for?