[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAGman failed to detect a node's status, seems because it could not read its log.



On Mon, 21 Dec 2009, dawnsong wrote:

It is fixed in the final. I set all the nodes share a same log file.

As condor manual 7.3 said, DAGman support seperate logs by seperate nodes,
but it seems that all nodes share one same log would make DAGman easy to run
without complainent about "ERROR: failure to read job log".

This confused me sine I have already upgraded to 7.4.

From your earlier email, it sounds like your log file(s) are on NFS; is
that correct?  If so, that's most likely the source of your problems.

When you upgraded the DAGMan version, did you re-run the DAG from scratch, or did you run it in recovery mode? Do you still have the log file that generated the error? If so, I'd like to take a look at it. I'm guessing that if the file was on NFS, you got corrupted events because of two events being written at the same time.

Kent Wenger
Condor Team