[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] problem with failure associated with LOG LINE CACHE
- Date: Thu, 9 Sep 2010 17:59:30 -0700
- From: Stuart Anderson <anderson@xxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] problem with failure associated with LOG LINE CACHE
Do you want to provide the requested log files or should we forget about this one for now.
On Jun 4, 2010, at 7:51 AM, R. Kent Wenger wrote:
> On Fri, 4 Jun 2010, Jameson Rollins wrote:
>> On Thu, 3 Jun 2010 10:57:52 -0500 (CDT), "R. Kent Wenger" <wenger@xxxxxxxxxxx> wrote:
>>> Okay, it seems like there are several issues here. One is, what caused
>>> DAGMan to go into recovery mode? Did you condor_hold/condor_release the
>>> DAGMan job? If not, did the submit machine go down, or did the schedd on
>>> that machine crash?
>> Thanks so much for the response, Kent.
>> After a couple more people had similar issues, I think we determined
>> that the problem was a full local storage disk, where the job log files
>> were being written. That at least seems to have been what caused things
>> to go into recovery mode.
> Hmm, that's interesting. I can see why that would goof up DAGMan when it tried to read events, but the fact that it went into recovery mode is still interesting.
>>> On Wed, 2 Jun 2010, Jameson Rollins wrote:
>>>> Note that at this point it's saying that only 1802 jobs are complete,
>>>> even though 7567 were reported complete before the LOG LINE CACHE flush
>>> That's probably because it got goofed up by the error reading the event.
>>> The biggest question is what caused that error.
>> This is the issue that bothered me the most. Somehow after the dag
>> failed and went into recovery mode, it seems to have lost track of how
>> many completed jobs there were, and ended up writing an incorrect rescue
>> dag. Any idea what could have caused that?
> Well, I'm wondering if the full disk caused that.
>>> A few more questions relating to the event-reading error:
>>> * Is it possible that you have some other Condor jobs that are using some
>>> of the log files used by the jobs in your DAG? That might cause the
>>> problem that aborted the DAG.
>> My user had no other running dags or jobs of any kind. However,
>> multiple jobs from this dag were using the same log files, though, but I
>> think that's normal. For instance, a given submit file:
>> which specified it's log file as, say:
>> log = /usr1/jrollins/analyses/in/0.037/analysis.log
>> was being called multiple times in the dag with different parameters, ie:
>> JOB in:0.037:938000353-938003143 /home/jrollins/analyses/in/0.037/analysis.sub
>> VARS in:0.037:938000353-938003143 start="938000353" stop="938003143"
>> RETRY in:0.037:938000353-938003143 3
>> JOB in:0.037:938004408-938005441 /home/jrollins/analyses/in/0.037/analysis.sub
>> VARS in:0.037:938004408-938005441 start="938004408" stop="938005441"
>> RETRY in:0.037:938004408-938005441 3
>> Does that make sense? I thought this was standard, but maybe this is
>> the incorrect way to handle it?
> This is fine. The problem happens if you have jobs in two different DAGs using the same log file.
>>> * Are any of your job log files on NFS?
>> No, all the job log files are specified to be on local disks.
> Okay, that's good.
>>> * Is the in:76.800:947001628-947001765 one of the ones that actually was
>>> finished, but not considered finshed by DAGMan? (You can tell by checking
>>> for a DONE at the end of that JOB line in the rescue DAG.)
>> No, it was not. None of the rescue dags had this job marked as done.
>>> You might also want to try running condor_check_userlogs on the set of all
>>> log files used by jobs in your DAG, especially the log file used by the
>>> in:76.800:947001628-947001765 job.
>> This seems to indicate that things are ok, ie. it ends with:
>> Log(s) are okay
> Hmm, that's interesting.
>> Any other info that would be useful?
> Well, at this point I think I need to look at things in more detail. I think the best approach would be for you to put the dag file, dagman.out file, and all of the node job log files (preferably as a tarball) some place I can grab them via ftp. I want to take a closer look at what DAGMan thought it was doing...
> Kent Wenger
> Condor Team
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at:
Stuart Anderson anderson@xxxxxxxxxxxxxxxx