[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] problem with failure associated with LOG LINE CACHE



Hi, folks.  I'm having some trouble with a vanilla-universe DAG and I
was wondering if someone here might have some insight into what
happened.

Yesterday I submitted a 10476-job dag to our pool.  As of 7:07 this
morning, 7567 jobs were complete.  Then, at 7:08, the dagman.out read
the following:

06/02 07:08:17 LOG LINE CACHE: Begin Flush

followed by 20,000 lines of logs with slightly different date stamp
formatting, ending with:

6/2 07:08:17 ERROR: failure to read job log
        A log event may be corrupt.  DAGMan will skip the event and try to
        continue, but information may have been lost.  If DAGMan exits
        unfinished, but reports no failed jobs, re-submit the rescue file
        to complete the DAG.
6/2 07:08:17     ------------------------------
6/2 07:08:17        Condor Recovery Complete
6/2 07:08:17     ------------------------------
06/02 07:08:17 LOG LINE CACHE: End Flush
06/02 07:08:17 Stopping the caching of log lines.
06/02 07:08:17 ...done with RECOVERY mode <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
06/02 07:08:17 Of 10476 nodes total:
06/02 07:08:17  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
06/02 07:08:17   ===     ===      ===     ===     ===        ===      ===
06/02 07:08:17  1802       0      383       0       0       8262       29

Note that at this point it's saying that only 1802 jobs are complete,
even though 7567 were reported complete before the LOG LINE CACHE flush
began.

A couple seconds later I get the following:

06/02 07:08:31 ERROR: node in:76.800:947001628-947001765: job ID in userlog submit event (71479218.0) doesn't match ID reported earlier by submit command (71625850.0)!  Aborting DAG; set DAGMAN_ABORT_ON_SCARY_SUBMIT to false if you are *sure* this shouldn't cause an abort.
06/02 07:08:31 Aborting DAG...
06/02 07:08:31 Writing Rescue DAG to injections.dag.rescue001...
06/02 07:08:32 Removing submitted jobs...
06/02 07:08:32 Removing any/all submitted Condor/Stork jobs...
06/02 07:08:32 Running: condor_rm -const DAGManJobId' '==' '71479114
06/02 07:08:32 Note: 0 total job deferrals because of -MaxJobs limit (2500)
06/02 07:08:32 Note: 0 total job deferrals because of -MaxIdle limit (50)
06/02 07:08:32 Note: 0 total job deferrals because of node category throttles
06/02 07:08:32 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
06/02 07:08:32 Note: 0 total POST script deferrals because of -MaxPost limit (0)
06/02 07:08:32 Warning: ReadMultipleUserLogs destructor called, but still monitoring 12 log(s)!
06/02 07:08:32 **** condor_scheduniv_exec.71479114.0
 (condor_DAGMAN) pid 13492 EXITING WITH STATUS 1

When I tried to resubmit the rescue dag, it again came up with only 1802
complete jobs.

Anyone have any idea what happened here?  I've never seen anything like
this before.  I don't know what the "LOG LINE CACHE" is, nor why the
status of the dag would appear so messed up after the "flush" was
complete.  

I found mention of fixed bug sounding very similar to this in the
release notes for 7.1.4 [0], but we're using 7.4.2.

Any insight would be much appreciated.

jamie.


[0] http://www.cs.wisc.edu/condor/manual/v7.1/8_3Development_Release.html :

  Fixed a bug where if log line caching was enabled in condor_dagman and
  condor_dagman failed during the recovery process, the cache would stay
  active. Now the cache is disabled in all cases at the end of recovery.

Attachment: pgpH4u5sDNGex.pgp
Description: PGP signature