[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] DAGman duplicating jobs on schedd restart
- Date: Thu, 3 Nov 2011 17:41:08 -0400
- From: Christopher Martin <cmartin@xxxxxxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] DAGman duplicating jobs on schedd restart
This is great Kent, thank you. We will probably be moving our logging off of the Windows share instead logging on the execute node and using the should_transfer_files option to transfer the logs back to the submit/scheduler machine. We've been having fileLock issues which is most likely related to this. Hopefully this will sort itself out with the logging changes.
Again, thanks for all you help.
On 3 November 2011 17:09, R. Kent Wenger <wenger@xxxxxxxxxxx>
On Thu, 3 Nov 2011, Christopher Martin wrote:
Okay, I now have a pretty good idea what's happening. The big clue is this error message in the dagman.out file:
Whenever the schedd restarts we're getting duplicate jobs showing up in the
queue. For example if we have a DAG like the following:
PARENT A CHILD B
PARENT B CHILD C D
Before the schedd restart, jobs A and B have completed and jobs C and D are
queued. After the schedd restarts we then have C and D still queued but B
has been added back into the queue as well. Is this a peculiarity of the
DAG rescue or perhaps it could be a conflict with the dagman logs?
10/04 01:47:11 fsync() failed in WriteUserLog::writeEvent - errno 5
(I'm thinking that this might well be related to the "too many files" open error you also reported.)
Anyhow, if you look at the relevant node job log files, the post script terminated events do not appear in the files, even though the POST scripts
completed (according to the dagman.out file).
So it looks like what happens is this:
1) The job A Condor job finishes (and the *Condorjob* terminated event is correctly logged).
2) DAGMan runs the job A POST script, and attempts to write the post script terminated event to the log file. Because the same process is writing and reading the log file, DAGMan successfully reads that event, even though it hasn't been fsynced.
3) Execution of the DAG continues; job B is finished, and job C and job D are submitted.
4) The schedd restarts, which also restarts DAGMan.
5) The new DAGMan process re-reads the log files to get its internal state consistent with the Condor pool.
6) Because the job A post script terminated event was not successfully fsynced, this time the DAGMan process does not read that event. It therefore concludes that the POST script of job A did *not* complete, which means that it doesn't look for any events in job B's log file, because job B should not have been submitted.
7) DAGMan now continues with the DAG, running the job A POST script; when that finishes, it submits job B's Condor job, since that's logically the next step in the DAG.
8) In the mean time, since job C and job D were in the queue when the schedd re-started, the schedd re-starts them without DAGMan taking any action. Therefore, you end up with job B, job C, and job D all in the queue at the same time.
(BTW, if you had all of your jobs logging to the same file, things would have ended up differently in this case. DAGMan would have seen the submitted event for job B (but, of course, no post script terminated event for job A), and it would have reported a fatal error that the DAG semantics were violated.)
Hopefully that all makes sense. In summary, I think the root cause is whatever is causing the input/output error when DAGMan is attempting to fsync the log file just after writing the post script terminated event.