[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAG File descriptor panic when quota is exceeded



Kent,

R. Kent Wenger wrote:
12/22 02:27:12 Currently monitoring 1145 Condor log file(s)
12/22 02:27:12 Node 2vv
...
If you are able to increase the file descriptor limit, things should work.

Things were working fine until I ran out of quota. For example, at 4:30pm yesterday I hit the high water mark for monitoring log files for this DAG:

12/21 16:29:38 Currently monitoring 1769 Condor log file(s)

The first DAG restart and PANIC happened at 2:29am and 2:35am on 12/22 respectively (or thereabouts). I have the file and proc limit set pretty high now.

/etc/security/limits.conf:
*               hard     nofile           40000
*               soft     nofile           40000
*               hard     nproc            20000
*               soft     nproc            20000

$ ulimit -H -a:
open files                      (-n) 40000
max user processes              (-u) 20000

The machine has been restarted since these changes were made, to be sure all daemon processes inherited the setting.

The best thing to do is condor_hold the DAGMan job, increase the file descriptor limit, and then condor_release the DAGMan job. This will put DAGMan into recovery mode, which will automatically read the logs to figure out what jobs had already been run, so you don't have to try to re-create a subset of your original DAG.

I'll try that next time. I've already condor_rm'ed the DAG since it was just looping on crash and restart without actually submitting any jobs. So condor_hold will create the rescue DAG? What happens to running jobs? Are they suspended/aborted? This is all in a Condor-G context.

I think you may also have to re-start the condor_schedd after changing the file descriptor limit (before condor_releasing the DAGMan job), so that the schedd gets the new limit, and therfore the new DAGMan process is forked with the new limit.

Machine was rebooted.

--
Ian Stokes-Rees                            W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 432-5608 x75
SBGrid, Harvard Medical School             F: +1 617 432-5600