[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_dagman.exe in idle after submit jobs completed



I am running a DAG that has taken approximately 6 days. All the submit 
jobs completed last night, but the condor_dagman.exe is not exiting and it 
is in idle. I noticed over the 6 days that the condor_dagman.exe would 
transition in and out of idle (jobs were always running however). Has 
anyone else had similar problems? Our pool consists of windows platforms 
only. 

Is there a way to get the dag to complete and does anyone have any ideas 
to what might be causing this?

thanks,
mike

Here is an excerpt from the dagman.out file, but I do not see any 
problems.

06/14/11 07:55:20 ******************************************************
06/14/11 07:55:20 ** condor_scheduniv_exec.55529.0 (CONDOR_DAGMAN) 
STARTING UP
06/14/11 07:55:20 ** C:\Condor\bin\condor_dagman.exe
06/14/11 07:55:20 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) 
class=DAEMON(1)
06/14/11 07:55:20 ** Configuration: subsystem:DAGMAN local:<NONE> 
class:DAEMON
06/14/11 07:55:20 ** $CondorVersion: 7.6.0 Apr 16 2011 BuildID: 327460 $
06/14/11 07:55:20 ** $CondorPlatform: x86_winnt_5.1 $
06/14/11 07:55:20 ** PID = 2140
06/14/11 07:55:20 ** Log last touched 6/14 06:50:31
06/14/11 07:55:20 ******************************************************
06/14/11 07:55:20 Using config source: 
\\igskbacbfssim\condor$\Secured_Config\Condor_Config\Global\FORTcondor_config
06/14/11 07:55:20 Using local config sources: 
06/14/11 07:55:20 
\\igskbacbfssim\condor$\Secured_Config\Condor_Config\Local\condor_config_IGSKBACBWS407.local
06/14/11 07:55:20 LISTEN <IP> fd=612
06/14/11 07:55:20 CONNECT bound to <IP> fd=608 peer=<IP>
06/14/11 07:55:20 ACCEPT bound to <IP> fd=32 peer=<IP>
06/14/11 07:55:20 CLOSE <IP> fd=612
06/14/11 07:55:20 LISTEN <IP> fd=612
06/14/11 07:55:20 DaemonCore: private command socket at <IP>
06/14/11 07:55:20 Setting maximum accepts per cycle 4.
06/14/11 07:55:20 DAGMAN_VERBOSITY setting: 3
06/14/11 07:55:20 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880
06/14/11 07:55:20 DAGMAN_DEBUG_CACHE_ENABLE setting: False
06/14/11 07:55:20 DAGMAN_SUBMIT_DELAY setting: 0
06/14/11 07:55:20 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
06/14/11 07:55:20 DAGMAN_STARTUP_CYCLE_DETECT setting: False
06/14/11 07:55:20 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
06/14/11 07:55:20 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5
06/14/11 07:55:20 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, 
DAGMAN_ALLOW_EVENTS) setting: 114
06/14/11 07:55:20 DAGMAN_RETRY_SUBMIT_FIRST setting: True
06/14/11 07:55:20 DAGMAN_RETRY_NODE_FIRST setting: False
06/14/11 07:55:20 DAGMAN_MAX_JOBS_IDLE setting: 0
06/14/11 07:55:20 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
06/14/11 07:55:20 DAGMAN_MAX_PRE_SCRIPTS setting: 0
06/14/11 07:55:20 DAGMAN_MAX_POST_SCRIPTS setting: 0
06/14/11 07:55:20 DAGMAN_ALLOW_LOG_ERROR setting: False
06/14/11 07:55:20 DAGMAN_MUNGE_NODE_NAMES setting: True
06/14/11 07:55:20 DAGMAN_PROHIBIT_MULTI_JOBS setting: False
06/14/11 07:55:20 DAGMAN_SUBMIT_DEPTH_FIRST setting: False
06/14/11 07:55:20 DAGMAN_ABORT_DUPLICATES setting: True
06/14/11 07:55:20 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True
06/14/11 07:55:20 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
06/14/11 07:55:20 DAGMAN_AUTO_RESCUE setting: True
06/14/11 07:55:20 DAGMAN_MAX_RESCUE_NUM setting: 100
06/14/11 07:55:20 DAGMAN_DEFAULT_NODE_LOG setting: null
06/14/11 07:55:20 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True
06/14/11 07:55:20 ALL_DEBUG setting: D_COMMAND D_NETWORK
06/14/11 07:55:20 DAGMAN_DEBUG setting: 
06/14/11 07:55:20 argv[0] == "condor_scheduniv_exec.55529.0"
06/14/11 07:55:20 argv[1] == "-Lockfile"
06/14/11 07:55:20 argv[2] == "GSLIB_DAG.dag.lock"
06/14/11 07:55:20 argv[3] == "-AutoRescue"
06/14/11 07:55:20 argv[4] == "1"
06/14/11 07:55:20 argv[5] == "-DoRescueFrom"
06/14/11 07:55:20 argv[6] == "0"
06/14/11 07:55:20 argv[7] == "-Dag"
06/14/11 07:55:20 argv[8] == "GSLIB_DAG.dag"
06/14/11 07:55:20 argv[9] == "-CsdVersion"
06/14/11 07:55:20 argv[10] == "$CondorVersion: 7.6.0 Apr 16 2011 BuildID: 
327460 $"
06/14/11 07:55:20 argv[11] == "-Dagman"
06/14/11 07:55:20 argv[12] == "C:\Condor\bin\condor_dagman.exe"
06/14/11 07:55:20 Default node log file is: 
<\\igskbacbfssim\gissim$\PrjRas\CondorFiles\Submits\Simulations_Step4\GSLIB_DAG.dag.nodes.log>
06/14/11 07:55:20 DAG Lockfile will be written to GSLIB_DAG.dag.lock
06/14/11 07:55:20 DAG Input file is GSLIB_DAG.dag
06/14/11 07:55:20 Parsing 1 dagfiles
06/14/11 07:55:20 Parsing GSLIB_DAG.dag ...
06/14/11 07:55:20 Dag contains 1080 total jobs
06/14/11 07:55:20 Lock file GSLIB_DAG.dag.lock detected, 
06/14/11 07:55:20 Duplicate DAGMan PID 796 is no longer alive; this DAGMan 
should continue.
06/14/11 07:55:20 Sleeping for 12 seconds to ensure ProcessId uniqueness
06/14/11 07:55:32 WARNING: ProcessId not confirmed unique
06/14/11 07:55:32 Bootstrapping...
06/14/11 07:55:32 Number of pre-completed nodes: 0
06/14/11 07:55:32 Running in RECOVERY mode... 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>