[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor DAG spinning



Hi all,

I've got an issue where, with a sufficient number of jobs in a dag, the DAGMan continues to crash and stay running.  There's 1900 jobs in the dag and about 500 complete successfully.  In the end, the only thing I have on my queue is the dag itself.

10/27 10:17:20 Parsing C:\temp\condor\condor_68353.dag ...
10/27 10:17:21 Dag contains 1903 total jobs
10/27 10:17:21 Lock file C:\temp\condor\condor_68353.dag.lock detected,
10/27 10:17:21 Duplicate DAGMan PID 5708 is no longer alive; this DAGMan should continue.
10/27 10:17:21 Sleeping for 12 seconds to ensure ProcessId uniqueness
10/27 10:17:33 WARNING: ProcessId not confirmed unique
10/27 10:17:33 Bootstrapping...
10/27 10:17:33 Number of pre-completed nodes: 0
10/27 10:17:33 Running in RECOVERY mode...
10/27 10:17:36 ******************************************************
10/27 10:17:36 ** condor_scheduniv_exec.4250.0 (CONDOR_DAGMAN) STARTING UP
10/27 10:17:36 ** C:\condor\bin\condor_dagman.exe
10/27 10:17:36 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $
10/27 10:17:36 ** $CondorPlatform: INTEL-WINNT50 $
10/27 10:17:36 ** PID = 1948
10/27 10:17:37 ** Log last touched 10/27 09:17:34
10/27 10:17:37 ******************************************************
10/27 10:17:37 Using config source: C:\condor\condor_config
10/27 10:17:37 Using local config sources:
10/27 10:17:37    C:\condor\condor_config.local
10/27 10:17:37 DaemonCore: Command Socket at <10.10.242.54:1795>
10/27 10:17:37 DAGMAN_SUBMIT_DELAY setting: 0
10/27 10:17:37 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
10/27 10:17:37 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
10/27 10:17:37 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
10/27 10:17:37 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
10/27 10:17:37 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
10/27 10:17:37 DAGMAN_RETRY_NODE_FIRST setting: 0
10/27 10:17:37 DAGMAN_MAX_JOBS_IDLE setting: 0
10/27 10:17:37 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
10/27 10:17:37 DAGMAN_MUNGE_NODE_NAMES setting: 1
10/27 10:17:37 DAGMAN_DELETE_OLD_LOGS setting: 1
10/27 10:17:37 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
10/27 10:17:37 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
10/27 10:17:37 DAGMAN_ABORT_DUPLICATES setting: 1
10/27 10:17:37 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
10/27 10:17:37 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
10/27 10:17:37 argv[0] == "condor_scheduniv_exec.4250.0"
10/27 10:17:37 argv[1] == "-Debug"
10/27 10:17:37 argv[2] == "3"
10/27 10:17:37 argv[3] == "-Lockfile"
10/27 10:17:37 argv[4] == "C:\temp\condor\condor_68353.dag.lock"
10/27 10:17:37 argv[5] == "-Condorlog"
10/27 10:17:37 argv[6] == "C:\temp\condor\condor_49152.log"
10/27 10:17:37 argv[7] == "-Dag"
10/27 10:17:37 argv[8] == "C:\temp\condor\condor_68353.dag"
10/27 10:17:37 argv[9] == "-Rescue"
10/27 10:17:37 argv[10] == "C:\temp\condor\condor_68353.dag.rescue"
10/27 10:17:37 DAG Lockfile will be written to C:\temp\condor\condor_68353.dag.lock
10/27 10:17:37 DAG Input file is C:\temp\condor\condor_68353.dag
10/27 10:17:37 Rescue DAG will be written to C:\temp\condor\condor_68353.dag.rescue

... then it lists all of the log files:
10/27 10:17:38   C:\temp\condor\condor_49152.log (Condor)
10/27 10:17:38   C:\temp\condor\condor_81924.log (Condor)
...

Then repeat all this seconds later ...  this log grew huge ! :)

Should I increase the maxjobs in the condor dag submission to get this rolling?  Sorry to ask such a general question, but I'm hoping somebody can explain to me what's going on in this case or cases like this?

(This is with condor 7.0.4, so I'm back a few minor releases -- maybe its time to upgrade).

Appreciate the help as always :).

Steve


Ready for a deal-of-a-lifetime? Find fantastic offers on Windows 7, in one convenient place.