[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAG questions



On Fri, 18 Dec 2009, Ian Stokes-Rees wrote:

We're on 7.2.4 right now. We don't do upgrades on Fridays, but will upgrade to 7.4 on Monday.

Thanks for the -output_dir and -debug pointers -- I read the DAG documentation, but not the command details (man page) and missed seeing them. I was expecting they'd be config file params.

Now I'm more confused about my situation. I've setup a much smaller run with only 3500 nodes in the DAG, however I'm still getting PANIC messages due to lack of file descriptors. An identical submission with only 40 nodes works fine, so I feel that rules out my general configuration, and points to either an OS issue or a Condor config issue. I've completely stopped all condor processes and restarted them.

Unless your DAG is really "wide" (most of the 3500 nodes in the queue at one time) upgrading to 7.4 should fix your file descriptors issue. The DAGMan log file reading code underwent some pretty major changes between 7.2 and 7.4: now there's only an open file descriptor for each log file corresponding to a job that's actually in the queue; before, DAGMan opened all of the log files at the start, and kept them open.

(So for anyone else who runs into this problem and can't upgrade to 7.4, the answer is to make your node jobs use a smaller set of log files, as opposed to having a separate log file for each job.)

If you're running a 7.4 DAGMan, a new feature is that you don't have to specify a log file at all in your submit file -- if you don't, DAGMan will assign a default log file for you. In fact, this may be the preferred way to do things, especially if you want to re-use your submit files in more than one DAG. The default log files are per-DAG, so if you use the same submit file in two different DAGs you won't have to worry about log file collisions if you use the default log file feature.

Kent Wenger
Condor Team