Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor-g job status unknown caused dagman to exit

Date: Tue, 20 Apr 2010 16:14:42 -0400
From: Peter Doherty <doherty@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] condor-g job status unknown caused dagman to exit


On Apr 20, 2010, at 12:59 , R. Kent Wenger wrote:

On Tue, 20 Apr 2010, Peter Doherty wrote:
To the condor experts,
I'm using Condor-G and this job (see logs below) caused a large dagto exit unexpectedly this morning. I understand the concept of BADEVENTS ( in this case the job started executing after it wassupposed to be done)
But I want to know how to prevent this from happening.
From what I can see the job's state became unknown (What causesthis? I see it happen a lot actually)and then Condor abandoned the job, but then the job called backhome and instead of just ignoring it, it heard the call, but thengot confused, and dagman exited.
It seems like a bug to me.
You could try working around it by setting a non-default value for
DAGMAN_ALLOW_EVENTS (see
http://www.cs.wisc.edu/condor/manual/v7.4/3_3Configuration.html#19172
in the manual).
I'm actually not quite sure what DAGMan will do if it doesn'timmediatelyconsider the execute event "bad" -- it might still run into problemsfarther along, but it seems worth trying.



I have changed the default.
DAGMAN_ALLOW_EVENTS = 1

The only other value I can use is 5, and the manual basically says"don't do this""A value of 5 will never abort the DAG because of a bad event. Butthis value should almost never be used, because the "job re-run afterterminated event" bug breaks the semantics of the DAG."

I can try setting it to 5, I just hope it doesn't create more problemsthan it solves.

It seems like the real question is whether this is something thatshould be dealt with at the DAGMan level or the Condor level --we'll have to discuss that on our end. If we deal with it at theDAGMan level, we'd have to allow a new job state transition -- fromcompleted/failed to executing to possibly completed/succeeded.

So it sounds like you're saying that the situation that I'm seeing iswithin the design parameters, but can cause unexpected failures, sonot exactly a bug, but more of a design limitation?


Peter

References:
- [Condor-users] condor-g job status unknown caused dagman to exit
  - From: Peter Doherty
- Re: [Condor-users] condor-g job status unknown caused dagman to exit
  - From: R. Kent Wenger

Prev by Date: [Condor-users] Why was my job evicted?
Next by Date: Re: [Condor-users] trying to get parallel-universe jobs working
Previous by thread: Re: [Condor-users] condor-g job status unknown caused dagman to exit
Next by thread: [Condor-users] About condor_stats.
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] condor-g job status unknown caused dagman to exit