[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] DAGman and job removal



Hi folks,

 

I’m hoping someone with experience in DAGman can help me out here.

 

I have a single-node DAG which runs a suite of regression tests in a single cluster - up to 200 at a time on some occasions - on an FPGA design, and then combines all of the results using a SCRIPT POST which summarizes the tests which were submitted, passed, failed, or missing. Occasionally a test will find a corner case in the design that sends it off into never-never land, racking up hours and hours of runtime with no end in sight, and the users would like to be able to terminate such jobs without impacting the remaining running tests in the DAG.

 

However, a condor_rm on a single job within the DAG causes a DAG abort, and all other jobs in the DAG are removed by DAGman as it’s shutting down.

 

I’m trying to figure out if there’s a way to stop a given job in a way that doesn’t cause a DAG abort. I don’t think a kill signal that causes the job to exit on its own following a condor_rm would make a difference, because the DAGman would still see it as a condor_rm and continue to treat it as a DAG abort.

 

I previously had set up the system to have each regression test have its own cluster/node with a child job to run the POST, but the cluster ID got to looking like the national debt clock, so I redesigned it a bit to put all the regressions in a single cluster/node, but I’d have the same problem in that approach as well.

 

Perhaps there’s a job attribute or two which could be set to cause the job to be terminated and removed from the queue without upsetting DAGman or terminating any other jobs? Or find some way to deliver a kill signal without involving condor_rm, perhaps with a condor_ssh_to_job? I could whip up a little regression_rm script for the users to use.

 

Thanks for any suggestions or ideas you might have!

 

 

Michael V Pelletier

Principal Engineer


C: +1 339.293.9149
michael.v.pelletier@xxxxxxx


Raytheon Technologies

Information Technology

50 Apple Hill Drive

Tewksbury, MA 01876-1198

 

RTX.com | LinkedIn | Twitter | Instagram