Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAGman duplicating jobs on schedd restart

Date: Thu, 3 Nov 2011 16:09:04 -0500 (CDT)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [Condor-users] DAGman duplicating jobs on schedd restart

On Thu, 3 Nov 2011, Christopher Martin wrote:

Whenever the schedd restarts we're getting duplicate jobs showing up in the
queue. For example if we have a DAG like the following:

JOB A
JOB B
JOB C
JOB D

PARENT A CHILD B
PARENT B CHILD C D

Before the schedd restart, jobs A and B have completed and jobs C and D are
queued. After the schedd restarts we then have C and D still queued but B
has been added back into the queue as well. Is this a peculiarity of the
DAG rescue or perhaps it could be a conflict with the dagman logs?

Okay, I now have a pretty good idea what's happening. The big clue isthis error message in the dagman.out file:


  10/04 01:47:11 fsync() failed in WriteUserLog::writeEvent - errno 5
  (Input/output error)

(I'm thinking that this might well be related to the "too many files" openerror you also reported.)

Anyhow, if you look at the relevant node job log files, the post scriptterminated events do not appear in the files, even though the POST scripts

completed (according to the dagman.out file).

So it looks like what happens is this:

1) The job A Condor job finishes (and the *Condorjob* terminated event iscorrectly logged).

2) DAGMan runs the job A POST script, and attempts to write the postscript terminated event to the log file. Because the same process iswriting and reading the log file, DAGMan successfully reads that event,even though it hasn't been fsynced.

3) Execution of the DAG continues; job B is finished, and job C and job Dare submitted.


4) The schedd restarts, which also restarts DAGMan.

5) The new DAGMan process re-reads the log files to get its internal stateconsistent with the Condor pool.

6) Because the job A post script terminated event was not successfullyfsynced, this time the DAGMan process does not read that event. Ittherefore concludes that the POST script of job A did *not* complete,which means that it doesn't look for any events in job B's log file,because job B should not have been submitted.

7) DAGMan now continues with the DAG, running the job A POST script; whenthat finishes, it submits job B's Condor job, since that's logically thenext step in the DAG.

8) In the mean time, since job C and job D were in the queue when theschedd re-started, the schedd re-starts them without DAGMan taking anyaction. Therefore, you end up with job B, job C, and job D all in thequeue at the same time.

(BTW, if you had all of your jobs logging to the same file, things wouldhave ended up differently in this case. DAGMan would have seen thesubmitted event for job B (but, of course, no post script terminated eventfor job A), and it would have reported a fatal error that the DAGsemantics were violated.)

Hopefully that all makes sense. In summary, I think the root cause iswhatever is causing the input/output error when DAGMan is attempting tofsync the log file just after writing the post script terminated event.


Kent Wenger
Condor Team

Follow-Ups:
- Re: [Condor-users] DAGman duplicating jobs on schedd restart
  - From: Christopher Martin

References:
- [Condor-users] DAGman duplicating jobs on schedd restart
  - From: Christopher Martin

Prev by Date: [Condor-users] Too many open files
Next by Date: Re: [Condor-users] DAGman duplicating jobs on schedd restart
Previous by thread: Re: [Condor-users] DAGman duplicating jobs on schedd restart
Next by thread: Re: [Condor-users] DAGman duplicating jobs on schedd restart
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] DAGman duplicating jobs on schedd restart