[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAGman duplicating jobs on schedd restart



Hey Ian,

So from what I can see it's like you say, it's as if the dagman can't tell that the jobs have completed successfully. The job logs do indicate completion though. I'm wondering, do the jobs all have to log to the same log file? Currently I have each job logging to it's own log file. All logs for both the jobs and the dagman are logging to the same directory.
I've included snippets from a dagman.out that shows the state of things before and after the schedd restart.

Here's what the .dag looks like (there are PRE/POST SCRIPTs for each job but I left them out):

JOB SHOTVALIDATE shotvalidate.sub
JOB HAIRCACHE haircache.sub
JOB TCBUILD tcbuild.sub
JOB TCRENDER_CHR_ALL chr_all_rp_tcrender.sub
JOB TCRENDER_BG_ALL bg_all_rp_tcrender.sub
JOB TCCMPRENDER tccomp.sub
PARENT SHOTVALIDATE CHILD HAIRCACHE
PARENT HAIRCACHE CHILD TCBUILD
PARENT TCBUILD CHILD TCRENDER_CHR_ALL
PARENT TCBUILD CHILD TCRENDER_BG_ALL
PARENT TCRENDER_CHR_ALL TCRENDER_BG_ALL CHILD TCCMPRENDER


>From the dagman.out I can see that SHOTVALIDATE,  HAIRCACHE and TCBUILD has completed successfully and the TCRENDER_CHR_ALL and TCRENDER_BG_ALL jobs have been queued:

0/07 16:02:33 Number of idle job procs: 2
10/07 16:02:33 Event: ULOG_SUBMIT for Condor Node TCRENDER_CHR_ALL (147821.13)
10/07 16:02:33 Number of idle job procs: 2
10/07 16:02:33 Of 6 nodes total:
10/07 16:02:33  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
10/07 16:02:33   ===     ===      ===     ===     ===        ===      ===
10/07 16:02:33     3       0        2       0       0          1        0
10/07 16:12:33 600 seconds since last log event
10/07 16:12:33 Pending DAG nodes:
10/07 16:12:33   Node TCRENDER_CHR_ALL, Condor ID 147821, status STATUS_SUBMITTED
10/07 16:12:33   Node TCRENDER_BG_ALL, Condor ID 147820, status STATUS_SUBMITTED



Sometime later the schedd restarts and the dag starts back at HAIRCACHE. Meanwhile TCRENDER_CHR_ALL and TCRENDER_BG_ALL are still in the queue. Below is the continuation of the dagman.out



10/08 17:26:00 ******************************************************
10/08 17:26:00 ** condor_scheduniv_exec.141951.0 (CONDOR_DAGMAN) STARTING UP
10/08 17:26:00 ** /opt/condor-7.4.4/bin/condor_dagman
10/08 17:26:00 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)
10/08 17:26:00 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
10/08 17:26:00 ** $CondorVersion: 7.4.4 Oct 14 2010 BuildID: 279383 $
10/08 17:26:00 ** $CondorPlatform: I386-LINUX_RHEL5 $
10/08 17:26:00 ** PID = 3418
10/08 17:26:00 ** Log last touched 10/8 17:22:12
10/08 17:26:00 ******************************************************
10/08 17:26:00 Using config source: /etc/condor/condor_config
10/08 17:26:00 Using local config sources: 
10/08 17:26:00    /opt/condor.local/condor_config.local
10/08 17:26:00    /opt/condor/sbin/cycle_cache_config/cycle_cache_config.py /opt/condor.local/condor_config.cached 30 30 http://192.168.0.78:8080/condor/assigned_template/cycleserver01.marchentertainment.com |
10/08 17:26:00 DaemonCore: Command Socket at <192.168.0.78:48894>
10/08 17:26:00 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880
10/08 17:26:00 DAGMAN_DEBUG_CACHE_ENABLE setting: False
10/08 17:26:00 DAGMAN_SUBMIT_DELAY setting: 0
10/08 17:26:00 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
10/08 17:26:00 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
10/08 17:26:00 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
10/08 17:26:00 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5
10/08 17:26:00 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
10/08 17:26:00 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
10/08 17:26:00 DAGMAN_RETRY_NODE_FIRST setting: 0
10/08 17:26:00 DAGMAN_MAX_JOBS_IDLE setting: 0
10/08 17:26:00 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
10/08 17:26:00 DAGMAN_MUNGE_NODE_NAMES setting: 1
10/08 17:26:00 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
10/08 17:26:00 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
10/08 17:26:00 DAGMAN_ABORT_DUPLICATES setting: 1
10/08 17:26:00 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
10/08 17:26:00 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
10/08 17:26:00 DAGMAN_AUTO_RESCUE setting: 1
10/08 17:26:00 DAGMAN_MAX_RESCUE_NUM setting: 100
10/08 17:26:00 DAGMAN_DEFAULT_NODE_LOG setting: null
10/08 17:26:00 ALL_DEBUG setting: 
10/08 17:26:00 DAGMAN_DEBUG setting: 
10/08 17:26:00 argv[0] == "condor_scheduniv_exec.141951.0"
10/08 17:26:00 argv[1] == "-Debug"
10/08 17:26:00 argv[2] == "3"
10/08 17:26:00 argv[3] == "-Lockfile"
10/08 17:26:00 argv[4] == "/mnt/render/jobs/job_119422_animgatebegin_yko_119_3345_j0HKqu/animgatebegin.dag.lock"
10/08 17:26:00 argv[5] == "-AutoRescue"
10/08 17:26:00 argv[6] == "1"
10/08 17:26:00 argv[7] == "-DoRescueFrom"
10/08 17:26:00 argv[8] == "0"
10/08 17:26:00 argv[9] == "-Dag"
10/08 17:26:00 argv[10] == "/mnt/render/jobs/job_119422_animgatebegin_yko_119_3345_j0HKqu/animgatebegin.dag"
10/08 17:26:00 argv[11] == "-CsdVersion"
10/08 17:26:00 argv[12] == "$CondorVersion: 7.4.4 Oct 14 2010 BuildID: 279383 $"
10/08 17:26:00 Default node log file is: </mnt/render/jobs/job_119422_animgatebegin_yko_119_3345_j0HKqu/animgatebegin.dag.nodes.log>
10/08 17:26:00 DAG Lockfile will be written to /mnt/render/jobs/job_119422_animgatebegin_yko_119_3345_j0HKqu/animgatebegin.dag.lock
10/08 17:26:00 DAG Input file is /mnt/render/jobs/job_119422_animgatebegin_yko_119_3345_j0HKqu/animgatebegin.dag
10/08 17:26:00 Parsing 1 dagfiles
10/08 17:26:00 Parsing /mnt/render/jobs/job_119422_animgatebegin_yko_119_3345_j0HKqu/animgatebegin.dag ...
10/08 17:26:00 Dag contains 6 total jobs
10/08 17:26:00 Lock file /mnt/render/jobs/job_119422_animgatebegin_yko_119_3345_j0HKqu/animgatebegin.dag.lock detected, 
10/08 17:26:00 Duplicate DAGMan PID 6666 is no longer alive; this DAGMan should continue.
10/08 17:26:00 Sleeping for 12 seconds to ensure ProcessId uniqueness
10/08 17:26:12 Bootstrapping...
10/08 17:26:12 Number of pre-completed nodes: 0
10/08 17:26:12 Running in RECOVERY mode... >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
10/08 17:26:12 Currently monitoring 1 Condor log file(s)
10/08 17:26:12 Event: ULOG_SUBMIT for Condor Node SHOTVALIDATE (141955.0)
10/08 17:26:12 Number of idle job procs: 1
10/08 17:26:12 Event: ULOG_EXECUTE for Condor Node SHOTVALIDATE (141955.0)
10/08 17:26:12 Number of idle job procs: 0
10/08 17:26:12 Event: ULOG_IMAGE_SIZE for Condor Node SHOTVALIDATE (141955.0)
10/08 17:26:12 Event: ULOG_IMAGE_SIZE for Condor Node SHOTVALIDATE (141955.0)
10/08 17:26:12 Event: ULOG_JOB_TERMINATED for Condor Node SHOTVALIDATE (141955.0)
10/08 17:26:12 Node SHOTVALIDATE job proc (141955.0) completed successfully.
10/08 17:26:12 Node SHOTVALIDATE job completed
10/08 17:26:12 Number of idle job procs: 0
10/08 17:26:12     ------------------------------
10/08 17:26:12        Condor Recovery Complete
10/08 17:26:12     ------------------------------
10/08 17:26:12 Running POST script of Node SHOTVALIDATE...
10/08 17:26:12 ...done with RECOVERY mode <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
10/08 17:26:12 Of 6 nodes total:
10/08 17:26:12  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
10/08 17:26:12   ===     ===      ===     ===     ===        ===      ===
10/08 17:26:12     0       0        0       1       0          5        0
10/08 17:26:12 Registering condor_event_timer...
10/08 17:26:12 fsync() failed in WriteUserLog::writeEvent - errno 5 (Input/output error)
10/08 17:26:13 Currently monitoring 1 Condor log file(s)
10/08 17:26:13 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Node SHOTVALIDATE (141955.0)
10/08 17:26:13 POST Script of Node SHOTVALIDATE completed successfully.
10/08 17:26:13 Running PRE script of Node HAIRCACHE...
10/08 17:26:13 Of 6 nodes total:
10/08 17:26:13  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
10/08 17:26:13   ===     ===      ===     ===     ===        ===      ===
10/08 17:26:13     1       1        0       0       0          4        0
10/08 17:26:23 PRE Script of Node HAIRCACHE completed successfully.
10/08 17:26:28 Sleeping for one second for log file consistency
10/08 17:26:29 Submitting Condor Node HAIRCACHE job(s)...
10/08 17:26:29 submitting: condor_submit -a dag_node_name' '=' 'HAIRCACHE -a +DAGManJobId' '=' '141951 -a DAGManJobId' '=' '141951 -a submit_event_notes' '=' 'DAG' 'Node:' 'HAIRCACHE -a +DAGParentNodeNames' '=' '"SHOTVALIDATE" haircache.sub
10/08 17:26:29 From submit: Submitting job(s).
10/08 17:26:29 From submit: Logging submit event(s).
10/08 17:26:29 From submit: 1 job(s) submitted to cluster 149643.
10/08 17:26:29 assigned Condor ID (149643.0)
10/08 17:26:29 Just submitted 1 job this cycle...
10/08 17:26:29 Currently monitoring 1 Condor log file(s)
10/08 17:26:29 Event: ULOG_SUBMIT for Condor Node HAIRCACHE (149643.0)
10/08 17:26:29 Number of idle job procs: 1
10/08 17:26:29 Of 6 nodes total:
10/08 17:26:29  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
10/08 17:26:29   ===     ===      ===     ===     ===        ===      ===
10/08 17:26:29     1       0        1       0       0          4        0
10/08 17:36:29 600 seconds since last log event
10/08 17:36:29 Pending DAG nodes:
10/08 17:36:29   Node HAIRCACHE, Condor ID 149643, status STATUS_SUBMITTED
10/08 17:46:29 1200 seconds since last log event

Thanks for you help,
Chris.



On 3 November 2011 12:41, Ian Chesal <ichesal@xxxxxxxxxxxxxxxxxx> wrote:
On Thursday, 3 November, 2011 at 9:38 AM, Christopher Martin wrote:
Whenever the schedd restarts we're getting duplicate jobs showing up in the queue. For example if we have a DAG like the following:

JOB A
JOB B
JOB C
JOB D

PARENT A CHILD B
PARENT B CHILD C D

Before the schedd restart, jobs A and B have completed and jobs C and D are queued. After the schedd restarts we then have C and D still queued but B has been added back into the queue as well. Is this a peculiarity of the DAG rescue or perhaps it could be a conflict with the dagman logs?
Hi Chris,

What does DAGMan say? Anything the dag log that might be helpful?

I suspect the most likely cause for the resubmission of B is that DAGMan can't determine that it completed successfully. Is there a log not in the job log that indicates that B completed?

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/