[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] DAG File descriptor panic when quota is exceeded



I ran out of quota on a disk running a large DAG last night (around 2am). I bumped up the quota at 9am (10GB to 30GB), but the running DAG is still reporting file descriptor panics and DAGMan keeps crashing. Is this expected? I suppose some of the temporary/recovery files it is trying to use for the restart may be corrupted. Is there any way to test this? We had finished 30k of 100k nodes in the DAG. It would be nice not to have to restart the entire DAG (although I could write a script to re-generate the DAG with only the nodes that did not complete).

Suggestions on the best way to recover this situation would be greatly appreciated.

TIA.

Ian

Summary of dagman.out log file follows.

At 2:30am it looks like DAGMan fell over.  The last entry to this point is:

12/22 02:27:12 Currently monitoring 1145 Condor log file(s)
12/22 02:27:12 Node 2vv

Following 12 minutes of silence, DAGMan restarted:

12/22 02:39:40 ******************************************************
12/22 02:39:40 ** condor_scheduniv_exec.87001.0 (CONDOR_DAGMAN) STARTING UP
12/22 02:39:40 ** /opt/osg-shared/se/app/site/condor-7.4.0/bin/condor_dagman
12/22 02:39:40 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)
12/22 02:39:40 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
12/22 02:39:40 ** $CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $
12/22 02:39:40 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
12/22 02:39:40 ** PID = 2199
12/22 02:39:40 ** Log last touched 12/22 02:39:34
12/22 02:39:40 ******************************************************

It then outputs some more configuration details of DAGMan and a minute later gives a PANIC message and appears to crash again, then is restarted 13 minutes later.

12/22 02:40:35 Duplicate DAGMan PID 5305 is no longer alive; this DAGMan should continue.
12/22 02:40:35 Sleeping for 12 seconds to ensure ProcessId uniqueness
12/22 02:40:47 Running in RECOVERY mode...
**** PANIC -- OUT OF FILE DESCRIPTORS at line 846 in dprintf.c
12/22 02:53:50 ******************************************************
12/22 02:53:50 ** condor_scheduniv_exec.87001.0 (CONDOR_DAGMAN) STARTING UP
12/22 02:53:50 ** /opt/osg-shared/se/app/site/condor-7.4.0/bin/condor_dagman
12/22 02:53:50 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)
12/22 02:53:50 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
12/22 02:53:50 ** $CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $
12/22 02:53:50 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
12/22 02:53:50 ** PID = 7127
12/22 02:53:50 ** Log last touched 12/22 02:50:20
12/22 02:53:50 ****************************************12/22 02:53:50 Using config source: /opt/osg-shared/se/app/site/condor/etc/condor_config
12/22 02:53:50 Using local config sources:
12/22 02:53:50    /opt/osg-local/condor/condor_config.local
12/22 02:53:50 DaemonCore: Command Socket at <10.0.10.39:37797>
12/22 02:53:50 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880
12/22 02:53:50 DAGMAN_DEBUG_CACHE_ENABLE setting: False
12/22 0212/22 03:21:43 ******************************************************
12/22 03:21:43 ** condor_scheduniv_exec.87001.0 (CONDOR_DAGMAN) STARTING UP
12/22 03:21:43 ** /opt/osg-shared/se/app/site/condor-7.4.0/bin/condor_dagman
12/22 03:21:43 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)
12/22 03:21:43 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
12/22 03:21:43 ** $CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $
12/22 03:21:43 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
12/22 03:21:43 ** PID = 18043
12/22 03:21:43 ** Log last touched 12/22 03:21:37
12/22 03:21:43 ********************************************************************


This instance of DAGMan doesn't make it very far, before (apparently) crashing again, and this time taking almost 30 minutes to restart. It then goes into a loop crashing almost immediately (within a minute), and restarting 5-20 minutes later.

--
Ian Stokes-Rees, Research Associate
SBGrid, Harvard Medical School
http://sbgrid.org