[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] dagman jobs fail upredicatably


This is a well-known issue with NFS file locking -- it's consistently unreliable, and as a result we simply can't support DAGMan's use when your userlogs are written to NFS.

The good news is that in 99% of cases, it's easy to specify that they be written to a local directory instead (even if all your job i/o is being done via NFS -- the userlogs are written on the submit side), and when you do, the problem will go away.

Let us know if that doesn't solve things for you.  Thanks,


On Feb 4, 2005, at 6:20 AM, Dr Ian C. Smith wrote:

Dear All,

We've recently been using DAGman to get long running
jobs working on our pool using the DAG recursion idea.
The submit host is a Solaris 9 box and all of the
execution PCs are Win XP/Intel. While the majority
of jobs work fine and run to completion, occasionally
some die. This error message appears in file.dagman.out:

2/4 02:33:39 Event: ULOG_EXECUTE for Condor Job A (13506.0.0)
2/4 02:33:49 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 02:53:47 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 03:13:49 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 03:33:47 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 04:33:56 read error on log
/ffs/mat_alanca/condor/jobs/cl1/cmi600/mdr.log 2/4 04:33:56 ERROR:
failure to read job log
A log event may be corrupt. DAGMan will skip the event and try to
continue, but information may have been lost. If DAGMan exits
unfinished, but reports no failed jobs, re-submit the rescue file
to complete the DAG

The log files are stored on an NFS mounted filesystem which I suppose could cause problems but I can't understand why this would affect some jobs and not others running concurrently. The actually dagaman process still seems to be running happily on the submit host.

As a workaround can condor be set up to resubmit the rescue DAG automatically.

yours perplexed,


Dr Ian C. Smith,
e-Science team,
University of Liverpool,
Computing Services Department.

Condor-users mailing list

Peter Couvares                        University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
pfc@xxxxxxxxxxx                       1210 W. Dayton St. Rm #4241
(608) 265-8936                        Madison, WI 53706-1685