[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] problem with failure associated with LOG LINE CACHE



On Thu, 3 Jun 2010 10:57:52 -0500 (CDT), "R. Kent Wenger" <wenger@xxxxxxxxxxx> wrote:
> Okay, it seems like there are several issues here.  One is, what caused 
> DAGMan to go into recovery mode?  Did you condor_hold/condor_release the 
> DAGMan job?  If not, did the submit machine go down, or did the schedd on 
> that machine crash?

Thanks so much for the response, Kent.

After a couple more people had similar issues, I think we determined
that the problem was a full local storage disk, where the job log files
were being written.  That at least seems to have been what caused things
to go into recovery mode.

> On Wed, 2 Jun 2010, Jameson Rollins wrote:
> > Note that at this point it's saying that only 1802 jobs are complete,
> > even though 7567 were reported complete before the LOG LINE CACHE flush
> > began.
> 
> That's probably because it got goofed up by the error reading the event.
> The biggest question is what caused that error.

This is the issue that bothered me the most.  Somehow after the dag
failed and went into recovery mode, it seems to have lost track of how
many completed jobs there were, and ended up writing an incorrect rescue
dag.  Any idea what could have caused that?

> A few more questions relating to the event-reading error:
> 
> * Is it possible that you have some other Condor jobs that are using some 
> of the log files used by the jobs in your DAG?  That might cause the 
> problem that aborted the DAG.

My user had no other running dags or jobs of any kind.  However,
multiple jobs from this dag were using the same log files, though, but I
think that's normal.  For instance, a given submit file:

/home/jrollins/analyses/in/0.037/analysis.sub

which specified it's log file as, say:

log = /usr1/jrollins/analyses/in/0.037/analysis.log

was being called multiple times in the dag with different parameters, ie:

JOB in:0.037:938000353-938003143 /home/jrollins/analyses/in/0.037/analysis.sub
VARS in:0.037:938000353-938003143 start="938000353" stop="938003143"
RETRY in:0.037:938000353-938003143 3

JOB in:0.037:938004408-938005441 /home/jrollins/analyses/in/0.037/analysis.sub
VARS in:0.037:938004408-938005441 start="938004408" stop="938005441"
RETRY in:0.037:938004408-938005441 3

Does that make sense?  I thought this was standard, but maybe this is
the incorrect way to handle it?

> * Are any of your job log files on NFS?

No, all the job log files are specified to be on local disks.

> * Is the in:76.800:947001628-947001765 one of the ones that actually was 
> finished, but not considered finshed by DAGMan?  (You can tell by checking 
> for a DONE at the end of that JOB line in the rescue DAG.)

No, it was not.  None of the rescue dags had this job marked as done.

> You might also want to try running condor_check_userlogs on the set of all 
> log files used by jobs in your DAG, especially the log file used by the
> in:76.800:947001628-947001765 job.

This seems to indicate that things are ok, ie. it ends with:

Log(s) are okay

Any other info that would be useful?

jamie.

Attachment: pgpo_cc6Fg27H.pgp
Description: PGP signature