Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] dagman jobs fail upredicatably

Date: Fri, 04 Feb 2005 12:20:38 +0000
From: "Dr Ian C. Smith" <i.c.smith@xxxxxxxxxxxxxxx>
Subject: [Condor-users] dagman jobs fail upredicatably

Dear All,

We've recently been using DAGman to get long running
jobs working on our pool using the DAG recursion idea.
The submit host is a Solaris 9 box and all of the
execution PCs are Win XP/Intel. While the majority
of jobs work fine and run to completion, occasionally
some die. This error message appears in file.dagman.out:

2/4 02:33:39 Event: ULOG_EXECUTE for Condor Job A (13506.0.0)
2/4 02:33:49 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 02:53:47 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 03:13:49 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 03:33:47 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 04:33:56 read error on log
/ffs/mat_alanca/condor/jobs/cl1/cmi600/mdr.log 2/4 04:33:56 ERROR:
failure to read job log
        A log event may be corrupt.  DAGMan will skip the event and try to
        continue, but information may have been lost.  If DAGMan exits
        unfinished, but reports no failed jobs, re-submit the rescue file
        to complete the DAG


The log files are stored on an NFS mounted filesystem which I suppose
could cause problems but I can't understand why this would affect some
jobs and not others running concurrently. The actually dagaman process
still seems to be running happily on the submit host.

As a workaround can condor be set up to resubmit the rescue DAG automatically.

yours perplexed,

-ian.

-----------------------------------
Dr Ian C. Smith,
e-Science team,
University of Liverpool,
Computing Services Department.

Follow-Ups:
- Re: [Condor-users] dagman jobs fail upredicatably
  - From: Peter F. Couvares

Prev by Date: Re: [Condor-users] quoting of globusrsl string ?
Next by Date: [Condor-users] Per-user Job Control
Previous by thread: [Condor-users] "Claimed Idle" state on XP execute nodes, sched still thinks they're running
Next by thread: Re: [Condor-users] dagman jobs fail upredicatably
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] dagman jobs fail upredicatably