[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] problem with failure associated with LOG LINE CACHE



On Fri, 4 Jun 2010, Jameson Rollins wrote:

On Thu, 3 Jun 2010 10:57:52 -0500 (CDT), "R. Kent Wenger" <wenger@xxxxxxxxxxx> wrote:
Okay, it seems like there are several issues here.  One is, what caused
DAGMan to go into recovery mode?  Did you condor_hold/condor_release the
DAGMan job?  If not, did the submit machine go down, or did the schedd on
that machine crash?

Thanks so much for the response, Kent.

After a couple more people had similar issues, I think we determined
that the problem was a full local storage disk, where the job log files
were being written.  That at least seems to have been what caused things
to go into recovery mode.

Hmm, that's interesting. I can see why that would goof up DAGMan when it tried to read events, but the fact that it went into recovery mode is still interesting.

On Wed, 2 Jun 2010, Jameson Rollins wrote:
Note that at this point it's saying that only 1802 jobs are complete,
even though 7567 were reported complete before the LOG LINE CACHE flush
began.

That's probably because it got goofed up by the error reading the event.
The biggest question is what caused that error.

This is the issue that bothered me the most.  Somehow after the dag
failed and went into recovery mode, it seems to have lost track of how
many completed jobs there were, and ended up writing an incorrect rescue
dag.  Any idea what could have caused that?

Well, I'm wondering if the full disk caused that.

A few more questions relating to the event-reading error:

* Is it possible that you have some other Condor jobs that are using some
of the log files used by the jobs in your DAG?  That might cause the
problem that aborted the DAG.

My user had no other running dags or jobs of any kind.  However,
multiple jobs from this dag were using the same log files, though, but I
think that's normal.  For instance, a given submit file:

/home/jrollins/analyses/in/0.037/analysis.sub

which specified it's log file as, say:

log = /usr1/jrollins/analyses/in/0.037/analysis.log

was being called multiple times in the dag with different parameters, ie:

JOB in:0.037:938000353-938003143 /home/jrollins/analyses/in/0.037/analysis.sub
VARS in:0.037:938000353-938003143 start="938000353" stop="938003143"
RETRY in:0.037:938000353-938003143 3

JOB in:0.037:938004408-938005441 /home/jrollins/analyses/in/0.037/analysis.sub
VARS in:0.037:938004408-938005441 start="938004408" stop="938005441"
RETRY in:0.037:938004408-938005441 3

Does that make sense?  I thought this was standard, but maybe this is
the incorrect way to handle it?

This is fine. The problem happens if you have jobs in two different DAGs using the same log file.

* Are any of your job log files on NFS?

No, all the job log files are specified to be on local disks.

Okay, that's good.

* Is the in:76.800:947001628-947001765 one of the ones that actually was
finished, but not considered finshed by DAGMan?  (You can tell by checking
for a DONE at the end of that JOB line in the rescue DAG.)

No, it was not.  None of the rescue dags had this job marked as done.

You might also want to try running condor_check_userlogs on the set of all
log files used by jobs in your DAG, especially the log file used by the
in:76.800:947001628-947001765 job.

This seems to indicate that things are ok, ie. it ends with:

Log(s) are okay

Hmm, that's interesting.

Any other info that would be useful?

Well, at this point I think I need to look at things in more detail. I think the best approach would be for you to put the dag file, dagman.out file, and all of the node job log files (preferably as a tarball) some place I can grab them via ftp. I want to take a closer look at what DAGMan thought it was doing...

Kent Wenger
Condor Team