[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DAGman and job removal



Hi Mark

Interesting.
Is this a global knob that will affect all workflows? Or can also be set at a per DAG level?

Thanks
Karan

> On Oct 14, 2021, at 1:46 PM, Mark Coatsworth <coatsworth@xxxxxxxxxxx> wrote:
> 
> Hi Michael,
> 
> I think the best way to do what you're asking is to use a new feature we added in HTCondor v9.1.0. Try setting the following configuration option:
> 
> DAGMAN_PUT_FAILED_JOBS_ON_HOLD = True
> 
> When this is set, DAGMan will watch for failed jobs and immediately resubmit them on hold. The rest of the DAG workflow will continue running as usual. When all your other jobs have completed, you can condor_rm the failed job and the workflow will finish (it will still be considered a failure, but the other jobs are all done now).
> 
> Does that seem like it would work for you?
> 
> Mark
> 
> ________________________________________
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain <gthain@xxxxxxxxxxx>
> Sent: Thursday, October 14, 2021 12:04 PM
> To: htcondor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] DAGman and job removal
> 
> On 10/14/21 10:36 AM, Michael Pelletier via HTCondor-users wrote:
> Hi folks,
> 
> Iâm hoping someone with experience in DAGman can help me out here.
> 
> I have a single-node DAG which runs a suite of regression tests in a single cluster - up to 200 at a time on some occasions - on an FPGA design, and then combines all of the results using a SCRIPT POST which summarizes the tests which were submitted, passed, failed, or missing. Occasionally a test will find a corner case in the design that sends it off into never-never land, racking up hours and hours of runtime with no end in sight, and the users would like to be able to terminate such jobs without impacting the remaining running tests in the DAG.
> 
> 
> 
> This isn't a very condor solution, but assuming you are already launching your tests from a shell script, you could
> 
> ulimit -t limit_of_cpu_seconds
> 
> and have the script exit 0 (telling dagman the successor nodes should succeed).
> 
> 
> -greg
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/