[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DAGman and job removal



Hi Karan,

As with all other configuration knobs, this can be applied globally or set at a per-DAG level. To make the change globally, put the setting in your condor_config or condor_config.local file.

To set this at a per-DAG level, put it in a new file called (for example) mydag.config then add the following line to your .dag file:

CONFIG mydag.config

Hope this helps,

Mark


On 10/14/21 3:53 PM, Karan Vahi wrote:
Hi Mark

Interesting.
Is this a global knob that will affect all workflows? Or can also be set at a per DAG level?

Thanks
Karan

On Oct 14, 2021, at 1:46 PM, Mark Coatsworth <coatsworth@xxxxxxxxxxx> wrote:

Hi Michael,

I think the best way to do what you're asking is to use a new feature we added in HTCondor v9.1.0. Try setting the following configuration option:

DAGMAN_PUT_FAILED_JOBS_ON_HOLD = True

When this is set, DAGMan will watch for failed jobs and immediately resubmit them on hold. The rest of the DAG workflow will continue running as usual. When all your other jobs have completed, you can condor_rm the failed job and the workflow will finish (it will still be considered a failure, but the other jobs are all done now).

Does that seem like it would work for you?

Mark

________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain <gthain@xxxxxxxxxxx>
Sent: Thursday, October 14, 2021 12:04 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] DAGman and job removal

On 10/14/21 10:36 AM, Michael Pelletier via HTCondor-users wrote:
Hi folks,

Iâm hoping someone with experience in DAGman can help me out here.

I have a single-node DAG which runs a suite of regression tests in a single cluster - up to 200 at a time on some occasions - on an FPGA design, and then combines all of the results using a SCRIPT POST which summarizes the tests which were submitted, passed, failed, or missing. Occasionally a test will find a corner case in the design that sends it off into never-never land, racking up hours and hours of runtime with no end in sight, and the users would like to be able to terminate such jobs without impacting the remaining running tests in the DAG.



This isn't a very condor solution, but assuming you are already launching your tests from a shell script, you could

ulimit -t limit_of_cpu_seconds

and have the script exit 0 (telling dagman the successor nodes should succeed).


-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/