[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] jobs stop running when lots of people submit jobs



I have a condor installation which I use for training.

Sometimes when in use, it stops running jobs, with those jobs appearing 
as:

      2 match but reject the job for unknown reasons

in -better-analyze

When I attempt to put load on a fresh installation, both with condor jobs 
and with non-condor jobs, both from my own account and from several 
accounts at once, I cannot get this problem to reappear; but as soon as 
students start using it, the problems start (even to the extent that my 
test load scripts will be running in a loop happily for hours and then 
stop around the time students start)

So the only mechanism I have for recreating it at the moment is to point a 
room full of students at it (which is not an easily repeatable action).

This has happened a few times, but I now have an install that is in this 
state and still online rather than being taken down right after a 
tutorial.

This is using condor-6.8.4. Condor-G works OK submitting to Globus on 
other machines, but local execution through the vanilla universe does not 
(using a variety of submission mechanisms - through GRAM2, through 
condor_run, condor_submit, dagman).

I don't see anything in the logs that indicates what is causing this 
problem - does anyone have any advice about what I can look for?

--