[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAG File descriptor panic when quota is exceeded



On Tue, 22 Dec 2009, Ian Stokes-Rees wrote:

Things were working fine until I ran out of quota. For example, at 4:30pm yesterday I hit the high water mark for monitoring log files for this DAG:

12/21 16:29:38 Currently monitoring 1769 Condor log file(s)

The first DAG restart and PANIC happened at 2:29am and 2:35am on 12/22 respectively (or thereabouts). I have the file and proc limit set pretty high now.

/etc/security/limits.conf:
*               hard     nofile           40000
*               soft     nofile           40000
*               hard     nproc            20000
*               soft     nproc            20000

$ ulimit -H -a:
open files                      (-n) 40000
max user processes              (-u) 20000

The machine has been restarted since these changes were made, to be sure all daemon processes inherited the setting.

The best thing to do is condor_hold the DAGMan job, increase the file descriptor limit, and then condor_release the DAGMan job. This will put DAGMan into recovery mode, which will automatically read the logs to figure out what jobs had already been run, so you don't have to try to re-create a subset of your original DAG.

I'll try that next time. I've already condor_rm'ed the DAG since it was just looping on crash and restart without actually submitting any jobs. So condor_hold will create the rescue DAG? What happens to running jobs? Are they suspended/aborted? This is all in a Condor-G context.

There are two separate situations: recovery mode and rescue DAG (this always gets complicated to explain). When you condor_rm a DAGMan job, it tries to condor_rm all of the node jobs, and creates a rescue DAG. The rescue DAG has nodes marked DONE to record the progress of the DAG. When you re-run the DAG, it automatically runs the rescue DAG (for fairly recent versions of DAGMan -- for older versions you have to specify the rescue DAG file on the condor_submit_dag command line). When the rescue DAG is run, nodes are marked DONE as the DAG file is parsed, and then execution continues.

In recovery mode, there is no rescue DAG -- DAGMan re-reads the individual node job log files to "catch up" to the state of the jobs. (DAGMan goes into recovery mode after having condor_hold/condor_release done to it. When you do a condor_hold it suspends the DAGMan process itself, but it doesn't stop the currently-running node jobs).

Okay, one thing to do is run condor_check_userlogs on the log files of the
node jobs. That should tell you if the log files themselves are corrupted. (Depending on how many log files you have, you may not be able to run condor_check_userlogs on all of them at once; but it's fine to run condor_check_userlogs a number of times on different sets of log files.)

I'm kind of surprised you're getting the 'out of file descriptors' problem after changing the limits. It wouldn't surprise me that much that you got errors reading events, but you shouldn't get run out of file descriptors. There are at least two things to check:
1) The results of running condor_check_userlogs on the log files.
2) How many log files DAGMan says it's monitoring before it runs out of fds.

You probably should also set DAGMAN_DEBUG to D_FDS -- that will allow you to see what fd DAGMan is up to if it runs out again. You can do this by setting DAGMAN_DEBUG in your config file, or setting _CONDOR_DAGMAN_DEBUG in your environment before you run condor_submit_dag.

Kent Wenger
Condor Team