[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor DAG spinning
- Date: Wed, 28 Oct 2009 16:34:30 -0400
- From: Hoover Sam <shoover@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Condor DAG spinning
You can also increase your ulimit on your server, or limit the number
of DAGMAN jobs. I do both on my system. I increased the ulimit and set
DAGMAN_MAX_JOBS_IDLE = # of cores in condor cluster
DAGMAN_MAX_JOBS_SUBMITTED = 2 x # of cores in condor cluster
This keeps the queue to a reasonable size and limits the number of
open file descriptors.
On Oct 28, 2009, at 1:07 PM, R. Kent Wenger wrote:
On Wed, 28 Oct 2009, Steve Shaw wrote:
Thanks for the quick response Kent,
I tried the 7.2.4 release and sent 1000 python jobs in a single no-
dependency dag. Each job just created a file and exited. After
doing a condor_submit_dag on my created dag, I got 509 files back
and then it looks like my dag job got stuck and started idling
(with the 7.0.4 build, I could swear that it remained 'running' but
still had the same behavior). Looking at the lib.err file for the
dag, it had the error:
dprintf() had a fatal error in pid 8620
Can't open "bigjob.dag.dagman.out"
errno: 24 (Too many open files)
Okay, that explains your problems...
I assume that your node jobs are using a lot of different log
all DAGMans prior to 7.3.2, all of the log files are open all of the
In 7.3.2, the log file reading code was changed only have a log file
open when a job that logs to that file is in the queue. However,
has a bug in how the log file code deals with rescue DAGs. This is
in 7.4.0, so a 7.4.0 DAGMan would fix all of your problems.
Unfortunately, 7.4.0 hasn't been released yet.
So the workaround would be to change your node jobs to use a smaller
of log files. (In fact, performance-wise the best thing is for all
jobs to use the same log file.) If that's really hard on your end, I
guess we could send 7.4.0 pre-release DAGMan binaries, if you tell
architecture/OS you need.
(One general DAGMan note here -- in 7.3.2 and later versions of
you don't have to specify a log file in your node job submit files.
log file is specified, DAGMan will automatically plug in a default log
file. We think this will probably be the preferred way to do things,
since you'll automatically get a single log file, and you won't have
worry about "interference" if you use the same submit file in more
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
You can also unsubscribe by visiting
The archives can be found at: