[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Getting DAG node to fail on file transfer error



On 03/11/2014 16:14, R. Kent Wenger wrote:

Yes, getting the job to fail instead of going on hold is totally independent of whether the job is part of a DAG or not. So, you need to add an appropriate periodic_remove expression to your submit file(s).
Thank you. DAGman will see the job status as failed, presumably.

What attribute should I look for if I want to remove *all* held jobs? i.e. what's the right classAd attribute to look for to identify a job as being held?

By experiment, a manually-held job has

HoldReason = "via condor_hold (by user brian)"
HoldReasonCode = 1
PeriodicHold = false
NumSystemHolds = 0
HoldReasonSubCode = 0
OnExitHold = false

and when subsequently released:

OnExitHold = false
LastHoldReasonSubCode = 0
LastHoldReasonCode = 1
NumSystemHolds = 0
PeriodicHold = false
LastHoldReason = "via condor_hold (by user brian)"

So maybe:

periodic_remove = HoldReasonCode =!= UNDEFINED

?

It seems to work: the downside is that the hold reason is lost (I just get ULOG_JOB_ABORTED in dagman.out)

Regards,

Brian.