Re: [HTCondor-users] Behaviour of DAGMAN_ALWAYS_RUN

Re: [HTCondor-users] Behaviour of DAGMAN_ALWAYS_RUN_POST in absence of PRE

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Wed, 05 Apr 2017 14:18:19 +0000

From: Kent WENGER <wenger@xxxxxxxxxxx>

Subject: Re: [HTCondor-users] Behaviour of DAGMAN_ALWAYS_RUN_POST in absence of PRE

Brian Candler wrote:

> On 03/04/2017 19:02, Dimitri Maziuk wrote:
>> I wonder: in what scenario a post script that starts with
>>
>> #!/bin/sh
>> if [ $1 -ne 0 ] ; then exit $1 ; fi
>>
>> would cause problems?

> Only when you forget to do it.

> We recently had a problem when a broken dataset ended up getting
deployed. It was controlled by a top-level dag with subdags. After
grubbing through various condor log files, it turns out it was due to
one of the inner dags failing, but the top level DAG had POST scripts to
notify progress, and they weren't handling $RETURN properly.

> So I was just wondering if it was possible to idiot-proof this.

I'm liking the idea of dealing with this with one line in your DAG file, for example:

RUN_POST_ON_JOB_FAIL ALL_NODES false

(On the other hand, doing it in configuration rather than with a DAG command would make it easier to do across splices and sub-DAGs, but you'd have no way to do it on a per-node basis then.)

Kent

Mailing List Archives

Public Access

Re: [HTCondor-users] Behaviour of DAGMAN_ALWAYS_RUN_POST in absence of PRE