[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAG File descriptor panic when quota is exceeded



On Thu, 24 Dec 2009, Ian Stokes-Rees wrote:

I did a condor_rm earlier today on a 100k node DAG and Condor became intermittent then stopped responding for 45+ minutes. condor_restart and other attempts to revive it did not work (we only attempted these after about 30 minutes). Is this a possible side effect of the rescue DAG being created for a large DAG?

Well, it wouldn't be the creation of the rescue DAG causing the problems
(that's a pretty low-cost operation, and doesn't involve the schedd at
all).

What would be causing the problems is that when you condor_rm a DAGMan, it
in turn tries to condor_rm all currently-running node jobs.  Right now
DAGMan does this with a single condor_rm command, with a constraint that
the jobs removed must have the right DAGManJobId value.  Maybe the "real"
fix (from the DAGMan end of things, at least) is to replace that with
individual condor_rms of the jobs, with some kind of throttle on them.
I don't see any schedd knobs that seem like they would help with this
kind of situation.

Obviously the schedd should handle this better, but I'm not sure you can
do anything at the user level to fix it.

Kent Wenger
Condor Team