[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor DAG spinning



Thanks for the quick response Kent,

I tried the 7.2.4 release and sent 1000 python jobs in a single no-dependency dag.  Each job just created a file and exited.  After doing a condor_submit_dag on my created dag, I got 509 files back and then it looks like my dag job got stuck and started idling (with the 7.0.4 build, I could swear that it remained 'running' but still had the same behavior).  Looking at the lib.err file for the dag, it had the error:

dprintf() had a fatal error in pid 8620
Can't open "bigjob.dag.dagman.out"
errno: 24 (Too many open files)

and the mentioned bigjob.dag.dagman.out file has constantly growing output similar to that entered below:

10/27 21:13:44 Parsing 1 dagfiles
10/27 21:13:44 Parsing bigjob.dag ...
10/27 21:13:44 Dag contains 1000 total jobs
10/27 21:13:44 Lock file bigjob.dag.lock detected,
10/27 21:13:44 Duplicate DAGMan PID 10964 is no longer alive; this DAGMan should continue.
10/27 21:13:44 Sleeping for 12 seconds to ensure ProcessId uniqueness
10/27 21:13:56 WARNING: ProcessId not confirmed unique
10/27 21:13:56 Bootstrapping...
10/27 21:13:56 Number of pre-completed nodes: 0
10/27 21:13:56 Running in RECOVERY mode...
10/27 21:18:33 ******************************************************
10/27 21:18:33 ** condor_scheduniv_exec.21.0 (CONDOR_DAGMAN) STARTING UP
10/27 21:18:33 ** C:\condor\bin\condor_dagman.exe
10/27 21:18:33 ** SubsystemInfo: name=DAGMAN type=DAEMON(10) class=DAEMON(1)
10/27 21:18:33 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
10/27 21:18:33 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 21:18:33 ** $CondorPlatform: INTEL-WINNT50 $
10/27 21:18:33 ** PID = 10284
10/27 21:18:33 ** Log last touched 10/27 20:13:56
10/27 21:18:33 ******************************************************
10/27 21:18:33 Using config source: C:\condor\condor_config
10/27 21:18:33 Using local config sources:
10/27 21:18:33    C:\condor\condor_config.local
10/27 21:18:33 DaemonCore: Command Socket at <10.10.242.111:4214>
10/27 21:18:33 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880
10/27 21:18:33 DAGMAN_DEBUG_CACHE_ENABLE setting: False
10/27 21:18:33 DAGMAN_SUBMIT_DELAY setting: 0
10/27 21:18:33 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
10/27 21:18:34 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
10/27 21:18:34 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
10/27 21:18:34 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
10/27 21:18:34 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
10/27 21:18:34 DAGMAN_RETRY_NODE_FIRST setting: 0
10/27 21:18:34 DAGMAN_MAX_JOBS_IDLE setting: 0
10/27 21:18:34 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
10/27 21:18:34 DAGMAN_MUNGE_NODE_NAMES setting: 1
10/27 21:18:34 DAGMAN_DELETE_OLD_LOGS setting: 1
10/27 21:18:34 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
10/27 21:18:34 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
10/27 21:18:34 DAGMAN_ABORT_DUPLICATES setting: 1
10/27 21:18:34 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
10/27 21:18:34 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
10/27 21:18:34 DAGMAN_AUTO_RESCUE setting: 1
10/27 21:18:34 DAGMAN_MAX_RESCUE_NUM setting: 100
10/27 21:18:34 argv[0] == "condor_scheduniv_exec.21.0"
10/27 21:18:34 argv[1] == "-Debug"
10/27 21:18:34 argv[2] == "3"
10/27 21:18:34 argv[3] == "-Lockfile"
10/27 21:18:34 argv[4] == "bigjob.dag.lock"
10/27 21:18:34 argv[5] == "-AutoRescue"
10/27 21:18:34 argv[6] == "1"
10/27 21:18:34 argv[7] == "-DoRescueFrom"
10/27 21:18:34 argv[8] == "0"
10/27 21:18:34 argv[9] == "-Dag"
10/27 21:18:34 argv[10] == "bigjob.dag"
10/27 21:18:34 argv[11] == "-CsdVersion"
10/27 21:18:34 argv[12] == "$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $"
10/27 21:18:34 DAG Lockfile will be written to bigjob.dag.lock
10/27 21:18:34 DAG Input file is bigjob.dag
10/27 21:18:34 All DAG node user log files:
10/27 21:18:34   C:\condor\jobs\bigjob\bigjob1.log (Condor)
10/27 21:18:34   C:\condor\jobs\bigjob\bigjob2.log (Condor)
etc...

I figure that I must be doing something wrong with some configuration setting on my dag submission.  Or that there is some limitation on how big a DAG job can or should be.  Should I just split up the dag into smaller groups of jobs in the future? 

Appreciate any suggestions, again, as always :),
Steve

> Date: Tue, 27 Oct 2009 13:46:09 -0500
> From: wenger@xxxxxxxxxxx
> To: condor-users@xxxxxxxxxxx
> Subject: Re: [Condor-users] Condor DAG spinning
>
> On Tue, 27 Oct 2009, Steve Shaw wrote:
>
> > I've got an issue where, with a sufficient number of jobs in a dag, the
> > DAGMan continues to crash and stay running. There's 1900 jobs in the
> > dag and about 500 complete successfully. In the end, the only thing I
> > have on my queue is the dag itself.
> >
> > 10/27 10:17:20 Parsing C:\temp\condor\condor_68353.dag ...
> > 10/27 10:17:21 Dag contains 1903 total jobs
> > 10/27 10:17:21 Lock file C:\temp\condor\condor_68353.dag.lock detected,
> > 10/27 10:17:21 Duplicate DAGMan PID 5708 is no longer alive; this DAGMan should continue.
> > 10/27 10:17:21 Sleeping for 12 seconds to ensure ProcessId uniqueness
> > 10/27 10:17:33 WARNING: ProcessId not confirmed unique
> > 10/27 10:17:33 Bootstrapping...
> > 10/27 10:17:33 Number of pre-completed nodes: 0
> > 10/27 10:17:33 Running in RECOVERY mode...
> > 10/27 10:17:36 ******************************************************
> > 10/27 10:17:36 ** condor_scheduniv_exec.4250.0 (CONDOR_DAGMAN) STARTING UP
> > 10/27 10:17:36 ** C:\condor\bin\condor_dagman.exe
> > 10/27 10:17:36 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $
> > 10/27 10:17:36 ** $CondorPlatform: INTEL-WINNT50 $
> > 10/27 10:17:36 ** PID = 1948
> > 10/27 10:17:37 ** Log last touched 10/27 09:17:34
> > 10/27 10:17:37 ******************************************************
> > 10/27 10:17:37 Using config source: C:\condor\condor_config
> > 10/27 10:17:37 Using local config sources:
> > 10/27 10:17:37 C:\condor\condor_config.local
> > 10/27 10:17:37 DaemonCore: Command Socket at <10.10.242.54:1795>
> > 10/27 10:17:37 DAGMAN_SUBMIT_DELAY setting: 0
> > 10/27 10:17:37 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
> > 10/27 10:17:37 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
> > 10/27 10:17:37 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
> > 10/27 10:17:37 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
> > 10/27 10:17:37 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
> > 10/27 10:17:37 DAGMAN_RETRY_NODE_FIRST setting: 0
> > 10/27 10:17:37 DAGMAN_MAX_JOBS_IDLE setting: 0
> > 10/27 10:17:37 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
> > 10/27 10:17:37 DAGMAN_MUNGE_NODE_NAMES setting: 1
> > 10/27 10:17:37 DAGMAN_DELETE_OLD_LOGS setting: 1
> > 10/27 10:17:37 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
> > 10/27 10:17:37 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
> > 10/27 10:17:37 DAGMAN_ABORT_DUPLICATES setting: 1
> > 10/27 10:17:37 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
> > 10/27 10:17:37 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
> > 10/27 10:17:37 argv[0] == "condor_scheduniv_exec.4250.0"
> > 10/27 10:17:37 argv[1] == "-Debug"
> > 10/27 10:17:37 argv[2] == "3"
> > 10/27 10:17:37 argv[3] == "-Lockfile"
> > 10/27 10:17:37 argv[4] == "C:\temp\condor\condor_68353.dag.lock"
> > 10/27 10:17:37 argv[5] == "-Condorlog"
> > 10/27 10:17:37 argv[6] == "C:\temp\condor\condor_49152.log"
> > 10/27 10:17:37 argv[7] == "-Dag"
> > 10/27 10:17:37 argv[8] == "C:\temp\condor\condor_68353.dag"
> > 10/27 10:17:37 argv[9] == "-Rescue"
> > 10/27 10:17:37 argv[10] == "C:\temp\condor\condor_68353.dag.rescue"
> > 10/27 10:17:37 DAG Lockfile will be written to C:\temp\condor\condor_68353.dag.lock
> > 10/27 10:17:37 DAG Input file is C:\temp\condor\condor_68353.dag
> > 10/27 10:17:37 Rescue DAG will be written to C:\temp\condor\condor_68353.dag.rescue
> >
> > ... then it lists all of the log files:
> > 10/27 10:17:38 C:\temp\condor\condor_49152.log (Condor)
> > 10/27 10:17:38 C:\temp\condor\condor_81924.log (Condor)
> > ...
> >
> > Then repeat all this seconds later ... this log grew huge ! :)
> >
> > Should I increase the maxjobs in the condor dag submission to get this
> > rolling? Sorry to ask such a general question, but I'm hoping somebody
> > can explain to me what's going on in this case or cases like this?
> >
> > (This is with condor 7.0.4, so I'm back a few minor releases -- maybe
> > its time to upgrade).
>
> Hmm -- 7.0.4 *is* pretty old. I'd say the first thing to try is
> installing newer condor_dagman and condor_submit_dag binaries. You can
> just upgrade those two binaries without upgrading the rest of your Condor
> installation if you want to.
>
> I'd recommend going to either 7.2.4 (if you want to stay with a stable
> release) or 7.3.1. (7.3.2 has problem with rescue DAGs, which has been
> fixed for the upcoming 7.4.0.)
>
> If you still get the problem with a newer DAGMan version, please let us
> know and we'll look inth things further.
>
> Kent Wenger
> Condor Team
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/


CDN College or University student? Get Windows 7 for only $39.99 before Jan 3! Buy it now!