[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] new DAGman problems in Condor 6.8.4



Hi,

We've been using Condor very successfully to run recursive DAGMan jobs for sometime  
but recently since moving to Condor 6.8.4 I've noticed that rogue condor_dagman processes 
are appearing which seem to carry on indefinitely. These can quickly swamp the server if 
not cleared up. 

The *.dag files have the form:

Job A M1.sub
Script POST A  ./resubmit.pl

where the resubmit.pl script resubmits the Condor job if it hasn't
converged and is within the cycle limit. The problem seems to occur
particularly after a large number of resubmissions ( > 30 ). The *.dagman.out
file contains this which may be to do with it:

7/23 10:19:10 ******************************************************
7/23 10:19:10 ** condor_scheduniv_exec.154891.0 (CONDOR_DAGMAN) STARTING UP
7/23 10:19:10 ** /opt1/condor/bin/condor_dagman
7/23 10:19:10 ** $CondorVersion: 6.8.4 Feb  1 2007 $
7/23 10:19:10 ** $CondorPlatform: SUN4X-SOLARIS29 $
7/23 10:19:10 ** PID = 6054
7/23 10:19:10 ** Log last touched 7/23 10:19:09
7/23 10:19:10 ******************************************************
7/23 10:19:10 Using config source: /etc/condor/condor_config
7/23 10:19:10 Using local config sources: 
7/23 10:19:10    /opt1/condor/home/condor_config.local
7/23 10:19:10 DaemonCore: Command Socket at <138.253.xxx.xxx:39834>
7/23 10:19:10 DAGMAN_SUBMIT_DELAY setting: 0
7/23 10:19:10 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
7/23 10:19:10 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
7/23 10:19:10 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
7/23 10:19:10 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
7/23 10:19:10 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
7/23 10:19:10 DAGMAN_RETRY_NODE_FIRST setting: 0
7/23 10:19:10 DAGMAN_MAX_JOBS_IDLE setting: 0
7/23 10:19:10 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
7/23 10:19:10 DAGMAN_MUNGE_NODE_NAMES setting: 1
7/23 10:19:10 DAGMAN_DELETE_OLD_LOGS setting: 1
7/23 10:19:10 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
7/23 10:19:10 DAGMAN_ABORT_DUPLICATES setting: 0
7/23 10:19:10 argv[0] == "condor_scheduniv_exec.154891.0"
7/23 10:19:10 argv[1] == "-Debug"
7/23 10:19:10 argv[2] == "3"
7/23 10:19:10 argv[3] == "-Lockfile"
7/23 10:19:10 argv[4] == "./M1.dag.lock"
7/23 10:19:10 argv[5] == "-Condorlog"
7/23 10:19:10 argv[6] == "/opt2/condor_data/gamess/jobspool/short_M1/logs/M1.log"
7/23 10:19:10 argv[7] == "-Dag"
7/23 10:19:10 argv[8] == "./M1.dag"
7/23 10:19:10 argv[9] == "-Rescue"
7/23 10:19:10 argv[10] == "./M1.dag.rescue"
7/23 10:19:10 DAG Lockfile will be written to ./M1.dag.lock
7/23 10:19:10 DAG Input file is ./M1.dag
7/23 10:19:10 Rescue DAG will be written to ./M1.dag.rescue
7/23 10:19:10 All DAG node user log files:
7/23 10:19:10   /opt2/condor_data/gamess/jobspool/short_M1/logs/M1.log (Condor)
7/23 10:19:10 Parsing ./M1.dag ...
7/23 10:19:10 Dag contains 1 total jobs
7/23 10:19:10 Lock file ./M1.dag.lock detected, 
7/23 10:19:10 Bootstrapping...
7/23 10:19:10 Number of pre-completed nodes: 0
7/23 10:19:10 Running in RECOVERY mode...
7/23 10:19:10 Event: ULOG_SUBMIT for Condor Node A (154792.0)
7/23 10:19:10 Number of idle job procs: 1
7/23 10:19:10 Event: ULOG_EXECUTE for Condor Node A (154792.0)
7/23 10:19:10 Number of idle job procs: 0
7/23 10:19:10 Event: ULOG_IMAGE_SIZE for Condor Node A (154792.0)
7/23 10:19:10 Event: ULOG_JOB_TERMINATED for Condor Node A (154792.0)
7/23 10:19:10 Node A job proc (154792.0) completed successfully.
7/23 10:19:10 Node A job completed
7/23 10:19:10 Number of idle job procs: 0
7/23 10:19:10     ------------------------------
7/23 10:19:10        Condor Recovery Complete
7/23 10:19:10     ------------------------------
7/23 10:19:10 Running POST script of Node A...
7/23 10:19:10 ERROR "Create_Process: More ancestor environment IDs found than PIDENVID_MAX which is currently 32. Programmer Error." at line 6466 in file daemon_core.C

The "Env" job classad also seems to contain a huge string so possibly this is a array overflow ?

I don't know if this is related but I'm also seeing a large number of communication failures in the
log files e.g. in *.dagman.out

7/20 16:46:31 Registering condor_event_timer...
7/20 16:46:32 attempt to connect to <138.253.100.178:58098> failed: Connection refused (connect errno = 146).
7/20 16:46:32 ERROR: SECMAN:2003:TCP auth connection to <138.253.xxx.xxx:58098> failed

where 138.253.xxx.xxx is the IP of the submit/central manager host. This seems to cause the *dagman*
job to be evicted and re-run according to the *.dagman.log file:

001 (154891.000.000) 07/23 10:19:10 Job executing on host: <138.253.xxx.xxx:63519>
...
004 (154891.000.000) 07/23 10:19:10 Job was evicted.
	(0) Job was not checkpointed.
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job

This doesn't in itself seem to be fatal but it really does lead to large delays in jobs
starting to run. Also the log files getting very big very quickly !

Has any one else seen this. Any suggestions as to the cause/solution would be
most appreciated.

thanks,

-ian.

------------------------------
Dr Ian C. Smith
e-Science Team,
University of Liverpool,
Computing Services Department