[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Make runs fail?



Ok, I think I understand your problem now. The jobs are being submitted and monitored by an external program which doesnât consider a _hold or _rm-ed job to be complete, and will resubmit it thinking that it has gone missing.

If thatâs the case, the question is how does the submitting software decide whether a job needs to be resubmitted, and can that criteria be changed or extended? If the submitter code is looking at a job attribute, then if we can change which attribute it's looking at, we can use an expression for that attribute to set something appropriate whether the run failed or the job was interrupted as non-convergent.

Does that make sense?

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Kitlasten, Wesley via HTCondor-users
Sent: Friday, October 19, 2018 5:58 PM
To: htcondor-users@xxxxxxxxxxx
Cc: Kitlasten, Wesley <wkitlasten@xxxxxxxx>
Subject: Re: [HTCondor-users] [EXTERNAL] Re: Make runs fail?

Clarification:

The only solution I can come up with (until I move onto something more complex as time allows) is to wait until every parameter set has been submitted and then condor_rm the jobs individually... with a "sabotage node" on my local machine that forces the _held and _rm jobs to fail (yuck). If I condor_rm before all sets have been submitted and don't sabotage, the old/faulty sets just get resubmitted. Am I missing something?... aside from the time and experience to pursue the proper approach!

-- 
Wes Kitlasten
United States Geological Survey
2730 N. Deer Run Road
Carson City, NV 89701
(775) 887-7711