[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] DAGman duplicating jobs on schedd restart
- Date: Thu, 3 Nov 2011 16:09:04 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] DAGman duplicating jobs on schedd restart
On Thu, 3 Nov 2011, Christopher Martin wrote:
Whenever the schedd restarts we're getting duplicate jobs showing up in the
queue. For example if we have a DAG like the following:
PARENT A CHILD B
PARENT B CHILD C D
Before the schedd restart, jobs A and B have completed and jobs C and D are
queued. After the schedd restarts we then have C and D still queued but B
has been added back into the queue as well. Is this a peculiarity of the
DAG rescue or perhaps it could be a conflict with the dagman logs?
Okay, I now have a pretty good idea what's happening. The big clue is
this error message in the dagman.out file:
10/04 01:47:11 fsync() failed in WriteUserLog::writeEvent - errno 5
(I'm thinking that this might well be related to the "too many files" open
error you also reported.)
Anyhow, if you look at the relevant node job log files, the post script
terminated events do not appear in the files, even though the POST scripts
completed (according to the dagman.out file).
So it looks like what happens is this:
1) The job A Condor job finishes (and the *Condorjob* terminated event is
2) DAGMan runs the job A POST script, and attempts to write the post
script terminated event to the log file. Because the same process is
writing and reading the log file, DAGMan successfully reads that event,
even though it hasn't been fsynced.
3) Execution of the DAG continues; job B is finished, and job C and job D
4) The schedd restarts, which also restarts DAGMan.
5) The new DAGMan process re-reads the log files to get its internal state
consistent with the Condor pool.
6) Because the job A post script terminated event was not successfully
fsynced, this time the DAGMan process does not read that event. It
therefore concludes that the POST script of job A did *not* complete,
which means that it doesn't look for any events in job B's log file,
because job B should not have been submitted.
7) DAGMan now continues with the DAG, running the job A POST script; when
that finishes, it submits job B's Condor job, since that's logically the
next step in the DAG.
8) In the mean time, since job C and job D were in the queue when the
schedd re-started, the schedd re-starts them without DAGMan taking any
action. Therefore, you end up with job B, job C, and job D all in the
queue at the same time.
(BTW, if you had all of your jobs logging to the same file, things would
have ended up differently in this case. DAGMan would have seen the
submitted event for job B (but, of course, no post script terminated event
for job A), and it would have reported a fatal error that the DAG
semantics were violated.)
Hopefully that all makes sense. In summary, I think the root cause is
whatever is causing the input/output error when DAGMan is attempting to
fsync the log file just after writing the post script terminated event.