[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs stop running when lots of people submit jobs




The problem you describe sounds like a problem that was fixed in 7.0.0. Here's the entry in the 7.0.0 version history:

Fixed a problem in the /condor_ negotiator/ in which machines go unassigned when user priorities result in the machines getting split into shares that are rounded down to 0. For example if there are 10 machines and 100 equal priority submitters, then each submitter was getting 0.1 machines, which got rounded down to 0, so no machines were assigned to anybody. The message in the /condor_ negotiator/ log in this case was this:

Over submitter resource limit (0) ... only consider startd ranks


I hope that solves your problem!

--Dan

Ben Clifford wrote:
I have a condor installation which I use for training.

Sometimes when in use, it stops running jobs, with those jobs appearing as:

      2 match but reject the job for unknown reasons

in -better-analyze

When I attempt to put load on a fresh installation, both with condor jobs and with non-condor jobs, both from my own account and from several accounts at once, I cannot get this problem to reappear; but as soon as students start using it, the problems start (even to the extent that my test load scripts will be running in a loop happily for hours and then stop around the time students start)

So the only mechanism I have for recreating it at the moment is to point a room full of students at it (which is not an easily repeatable action).

This has happened a few times, but I now have an install that is in this state and still online rather than being taken down right after a tutorial.

This is using condor-6.8.4. Condor-G works OK submitting to Globus on other machines, but local execution through the vanilla universe does not (using a variety of submission mechanisms - through GRAM2, through condor_run, condor_submit, dagman).

I don't see anything in the logs that indicates what is causing this problem - does anyone have any advice about what I can look for?