[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor DAG spinning



On Tue, 27 Oct 2009, Steve Shaw wrote:

I've got an issue where, with a sufficient number of jobs in a dag, the DAGMan continues to crash and stay running. There's 1900 jobs in the dag and about 500 complete successfully. In the end, the only thing I have on my queue is the dag itself.

10/27 10:17:20 Parsing C:\temp\condor\condor_68353.dag ...
10/27 10:17:21 Dag contains 1903 total jobs
10/27 10:17:21 Lock file C:\temp\condor\condor_68353.dag.lock detected,
10/27 10:17:21 Duplicate DAGMan PID 5708 is no longer alive; this DAGMan should continue.
10/27 10:17:21 Sleeping for 12 seconds to ensure ProcessId uniqueness
10/27 10:17:33 WARNING: ProcessId not confirmed unique
10/27 10:17:33 Bootstrapping...
10/27 10:17:33 Number of pre-completed nodes: 0
10/27 10:17:33 Running in RECOVERY mode...
10/27 10:17:36 ******************************************************
10/27 10:17:36 ** condor_scheduniv_exec.4250.0 (CONDOR_DAGMAN) STARTING UP
10/27 10:17:36 ** C:\condor\bin\condor_dagman.exe
10/27 10:17:36 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $
10/27 10:17:36 ** $CondorPlatform: INTEL-WINNT50 $
10/27 10:17:36 ** PID = 1948
10/27 10:17:37 ** Log last touched 10/27 09:17:34
10/27 10:17:37 ******************************************************
10/27 10:17:37 Using config source: C:\condor\condor_config
10/27 10:17:37 Using local config sources:
10/27 10:17:37    C:\condor\condor_config.local
10/27 10:17:37 DaemonCore: Command Socket at <10.10.242.54:1795>
10/27 10:17:37 DAGMAN_SUBMIT_DELAY setting: 0
10/27 10:17:37 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
10/27 10:17:37 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
10/27 10:17:37 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
10/27 10:17:37 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
10/27 10:17:37 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
10/27 10:17:37 DAGMAN_RETRY_NODE_FIRST setting: 0
10/27 10:17:37 DAGMAN_MAX_JOBS_IDLE setting: 0
10/27 10:17:37 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
10/27 10:17:37 DAGMAN_MUNGE_NODE_NAMES setting: 1
10/27 10:17:37 DAGMAN_DELETE_OLD_LOGS setting: 1
10/27 10:17:37 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
10/27 10:17:37 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
10/27 10:17:37 DAGMAN_ABORT_DUPLICATES setting: 1
10/27 10:17:37 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
10/27 10:17:37 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
10/27 10:17:37 argv[0] == "condor_scheduniv_exec.4250.0"
10/27 10:17:37 argv[1] == "-Debug"
10/27 10:17:37 argv[2] == "3"
10/27 10:17:37 argv[3] == "-Lockfile"
10/27 10:17:37 argv[4] == "C:\temp\condor\condor_68353.dag.lock"
10/27 10:17:37 argv[5] == "-Condorlog"
10/27 10:17:37 argv[6] == "C:\temp\condor\condor_49152.log"
10/27 10:17:37 argv[7] == "-Dag"
10/27 10:17:37 argv[8] == "C:\temp\condor\condor_68353.dag"
10/27 10:17:37 argv[9] == "-Rescue"
10/27 10:17:37 argv[10] == "C:\temp\condor\condor_68353.dag.rescue"
10/27 10:17:37 DAG Lockfile will be written to C:\temp\condor\condor_68353.dag.lock
10/27 10:17:37 DAG Input file is C:\temp\condor\condor_68353.dag
10/27 10:17:37 Rescue DAG will be written to C:\temp\condor\condor_68353.dag.rescue

... then it lists all of the log files:
10/27 10:17:38   C:\temp\condor\condor_49152.log (Condor)
10/27 10:17:38   C:\temp\condor\condor_81924.log (Condor)
...

Then repeat all this seconds later ...  this log grew huge ! :)

Should I increase the maxjobs in the condor dag submission to get this rolling? Sorry to ask such a general question, but I'm hoping somebody can explain to me what's going on in this case or cases like this?

(This is with condor 7.0.4, so I'm back a few minor releases -- maybe its time to upgrade).

Hmm -- 7.0.4 *is* pretty old. I'd say the first thing to try is installing newer condor_dagman and condor_submit_dag binaries. You can just upgrade those two binaries without upgrading the rest of your Condor installation if you want to.

I'd recommend going to either 7.2.4 (if you want to stay with a stable release) or 7.3.1. (7.3.2 has problem with rescue DAGs, which has been fixed for the upcoming 7.4.0.)

If you still get the problem with a newer DAGMan version, please let us know and we'll look inth things further.

Kent Wenger
Condor Team