[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] problem with failure associated with LOG LINE CACHE



On Wed, 2 Jun 2010, Jameson Rollins wrote:

Hi, folks.  I'm having some trouble with a vanilla-universe DAG and I
was wondering if someone here might have some insight into what
happened.

Yesterday I submitted a 10476-job dag to our pool.  As of 7:07 this
morning, 7567 jobs were complete.  Then, at 7:08, the dagman.out read
the following:

06/02 07:08:17 LOG LINE CACHE: Begin Flush

followed by 20,000 lines of logs with slightly different date stamp
formatting, ending with:

6/2 07:08:17 ERROR: failure to read job log
       A log event may be corrupt.  DAGMan will skip the event and try to
       continue, but information may have been lost.  If DAGMan exits
       unfinished, but reports no failed jobs, re-submit the rescue file
       to complete the DAG.
6/2 07:08:17     ------------------------------
6/2 07:08:17        Condor Recovery Complete
6/2 07:08:17     ------------------------------
06/02 07:08:17 LOG LINE CACHE: End Flush
06/02 07:08:17 Stopping the caching of log lines.
06/02 07:08:17 ...done with RECOVERY mode <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
06/02 07:08:17 Of 10476 nodes total:
06/02 07:08:17  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
06/02 07:08:17   ===     ===      ===     ===     ===        ===      ===
06/02 07:08:17  1802       0      383       0       0       8262       29

Okay, it seems like there are several issues here. One is, what caused DAGMan to go into recovery mode? Did you condor_hold/condor_release the DAGMan job? If not, did the submit machine go down, or did the schedd on that machine crash?

Note that at this point it's saying that only 1802 jobs are complete,
even though 7567 were reported complete before the LOG LINE CACHE flush
began.

That's probably because it got goofed up by the error reading the event.
The biggest question is what caused that error.

A couple seconds later I get the following:

06/02 07:08:31 ERROR: node in:76.800:947001628-947001765: job ID in userlog submit event (71479218.0) doesn't match ID reported earlier by submit command (71625850.0)! Aborting DAG; set DAGMAN_ABORT_ON_SCARY_SUBMIT to false if you are *sure* this shouldn't cause an abort.

When I tried to resubmit the rescue dag, it again came up with only 1802
complete jobs.

Yeah, that's correct for the rescue DAG, given that it only thought 1802 jobs were complete when the rescue DAG was generated. I don't think there's a problem with the rescue DAG itself.

Anyone have any idea what happened here?  I've never seen anything like
this before.  I don't know what the "LOG LINE CACHE" is, nor why the
status of the dag would appear so messed up after the "flush" was
complete.

I found mention of fixed bug sounding very similar to this in the
release notes for 7.1.4 [0], but we're using 7.4.2.

A few more questions relating to the event-reading error:

* Is it possible that you have some other Condor jobs that are using some of the log files used by the jobs in your DAG? That might cause the problem that aborted the DAG.

* Are any of your job log files on NFS?

* Is the in:76.800:947001628-947001765 one of the ones that actually was finished, but not considered finshed by DAGMan? (You can tell by checking for a DONE at the end of that JOB line in the rescue DAG.)

You might also want to try running condor_check_userlogs on the set of all log files used by jobs in your DAG, especially the log file used by the
in:76.800:947001628-947001765 job.

Kent Wenger
Condor Team