[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DAGman and job removal



On 10/14/21 10:36 AM, Michael Pelletier via HTCondor-users wrote:

Hi folks,

 

I’m hoping someone with experience in DAGman can help me out here.

 

I have a single-node DAG which runs a suite of regression tests in a single cluster - up to 200 at a time on some occasions - on an FPGA design, and then combines all of the results using a SCRIPT POST which summarizes the tests which were submitted, passed, failed, or missing. Occasionally a test will find a corner case in the design that sends it off into never-never land, racking up hours and hours of runtime with no end in sight, and the users would like to be able to terminate such jobs without impacting the remaining running tests in the DAG.

 


This isn't a very condor solution, but assuming you are already launching your tests from a shell script, you could

ulimit -t limit_of_cpu_seconds

and have the script exit 0 (telling dagman the successor nodes should succeed).


-greg