Re: [HTCondor-users] DAGman and job removal

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Thu, 14 Oct 2021 12:04:55 -0500

Subject: Re: [HTCondor-users] DAGman and job removal

On 10/14/21 10:36 AM, Michael Pelletier via HTCondor-users wrote:

Hi folks,

I’m hoping someone with experience in DAGman can help me out here.

I have a single-node DAG which runs a suite of regression tests in a single cluster - up to 200 at a time on some occasions - on an FPGA design, and then combines all of the results using a SCRIPT POST which summarizes the tests which were submitted, passed, failed, or missing. Occasionally a test will find a corner case in the design that sends it off into never-never land, racking up hours and hours of runtime with no end in sight, and the users would like to be able to terminate such jobs without impacting the remaining running tests in the DAG.

This isn't a very condor solution, but assuming you are already launching your tests from a shell script, you could

ulimit -t limit_of_cpu_seconds

and have the script exit 0 (telling dagman the successor nodes should succeed).

-greg

Mailing List Archives

Public Access

Re: [HTCondor-users] DAGman and job removal