[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor DAG spinning



On Wed, 28 Oct 2009, Steve Shaw wrote:

Thanks for the quick response Kent,

I tried the 7.2.4 release and sent 1000 python jobs in a single no-dependency dag.  Each job just created a file and exited.  After doing a condor_submit_dag on my created dag, I got 509 files back and then it looks like my dag job got stuck and started idling (with the 7.0.4 build, I could swear that it remained 'running' but still had the same behavior).  Looking at the lib.err file for the dag, it had the error:

dprintf() had a fatal error in pid 8620
Can't open "bigjob.dag.dagman.out"
errno: 24 (Too many open files)

Okay, that explains your problems...

I assume that your node jobs are using a lot of different log files.  In
all DAGMans prior to 7.3.2, all of the log files are open all of the time.
In 7.3.2, the log file reading code was changed only have a log file open when a job that logs to that file is in the queue. However, 7.3.2 has a bug in how the log file code deals with rescue DAGs. This is fixed in 7.4.0, so a 7.4.0 DAGMan would fix all of your problems. Unfortunately, 7.4.0 hasn't been released yet.

So the workaround would be to change your node jobs to use a smaller "set" of log files. (In fact, performance-wise the best thing is for all node jobs to use the same log file.) If that's really hard on your end, I guess we could send 7.4.0 pre-release DAGMan binaries, if you tell us what
architecture/OS you need.

(One general DAGMan note here -- in 7.3.2 and later versions of DAGMan, you don't have to specify a log file in your node job submit files. If no log file is specified, DAGMan will automatically plug in a default log file. We think this will probably be the preferred way to do things, since you'll automatically get a single log file, and you won't have to worry about "interference" if you use the same submit file in more than one DAG.)

Kent Wenger
Condor Team