[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAGman duplicating jobs on schedd restart

On Thu, 3 Nov 2011, Christopher Martin wrote:

So from what I can see it's like you say, it's as if the dagman can't tell
that the jobs have completed successfully. The job logs do indicate
completion though. I'm wondering, do the jobs all have to log to the same
log file? Currently I have each job logging to it's own log file. All logs
for both the jobs and the dagman are logging to the same directory.
I've included snippets from a dagman.out that shows the state of things
before and after the schedd restart.

It's fine to have any combination of jobs logging to their own log files vs. jobs logging to a common log file. It's important, though, that jobs in separate DAGs not share log files (unless you're 100% sure the DAGs won't be run at the same time).

Can you send the following files?:
* dagman.out
* the actual dag file
* the node job log files

If you do that, I'll take a look in more detail and see what I can figure out.

From your original email, it sounds like this problem happens consistently
when your schedd restarts -- is that right? If so, that eliminates the things that would be my first guesses as to the problem (e.g., some kind of transient log file reading error).

Kent Wenger
Condor Team