[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAG File descriptor panic when quota is exceeded



On Tue, 22 Dec 2009, Ian Stokes-Rees wrote:

I ran out of quota on a disk running a large DAG last night (around 2am). I bumped up the quota at 9am (10GB to 30GB), but the running DAG is still reporting file descriptor panics and DAGMan keeps crashing. Is this expected? I suppose some of the temporary/recovery files it is trying to use for the restart may be corrupted. Is there any way to test this? We had finished 30k of 100k nodes in the DAG. It would be nice not to have to restart the entire DAG (although I could write a script to re-generate the DAG with only the nodes that did not complete).

Suggestions on the best way to recover this situation would be greatly appreciated.

TIA.

Ian

Summary of dagman.out log file follows.

At 2:30am it looks like DAGMan fell over.  The last entry to this point is:

12/22 02:27:12 Currently monitoring 1145 Condor log file(s)
12/22 02:27:12 Node 2vv
...

The snippet above shows that DAGMan is currently trying to monitor 1145 log files, which means that it needs to have at least 1145 file descriptors. Do you know what the limit of file descriptors for a process is on your machine? I think 1024 is a common default on Linux, at least, which will obviously cause problems for monitoring 1145 log files.

If you are able to increase the file descriptor limit, things should work.
The best thing to do is condor_hold the DAGMan job, increase the file descriptor limit, and then condor_release the DAGMan job. This will put DAGMan into recovery mode, which will automatically read the logs to figure out what jobs had already been run, so you don't have to try to re-create a subset of your original DAG.

I think you may also have to re-start the condor_schedd after changing the file descriptor limit (before condor_releasing the DAGMan job), so that the schedd gets the new limit, and therfore the new DAGMan process is forked with the new limit.

Kent Wenger
Condor Team