[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DAGman and job removal



Hi Michael,

I think the best way to do what you're asking is to use a new feature we added in HTCondor v9.1.0. Try setting the following configuration option:

DAGMAN_PUT_FAILED_JOBS_ON_HOLD = True

When this is set, DAGMan will watch for failed jobs and immediately resubmit them on hold. The rest of the DAG workflow will continue running as usual. When all your other jobs have completed, you can condor_rm the failed job and the workflow will finish (it will still be considered a failure, but the other jobs are all done now).

Does that seem like it would work for you?

Mark

________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain <gthain@xxxxxxxxxxx>
Sent: Thursday, October 14, 2021 12:04 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] DAGman and job removal

On 10/14/21 10:36 AM, Michael Pelletier via HTCondor-users wrote:
Hi folks,

I’m hoping someone with experience in DAGman can help me out here.

I have a single-node DAG which runs a suite of regression tests in a single cluster - up to 200 at a time on some occasions - on an FPGA design, and then combines all of the results using a SCRIPT POST which summarizes the tests which were submitted, passed, failed, or missing. Occasionally a test will find a corner case in the design that sends it off into never-never land, racking up hours and hours of runtime with no end in sight, and the users would like to be able to terminate such jobs without impacting the remaining running tests in the DAG.



This isn't a very condor solution, but assuming you are already launching your tests from a shell script, you could

ulimit -t limit_of_cpu_seconds

and have the script exit 0 (telling dagman the successor nodes should succeed).


-greg