[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Make runs fail?

Wes, my 2017 HTCondor Week presentation may be of interest with respect to Todd's suggestion #3:

>>Detecting & Managing Job Events & Progress with an HTCondor Update Job Info Hook<<

I'd have to run through an IP review board to contribute my hook_checkfile code which can take a while, but hopefully the presentation will be enough to give you some ideas on ways to catch the bad jobs early.

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum
Sent: Friday, October 19, 2018 11:52 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] Make runs fail?


Some quick thoughts on the below --

1. Whenever I hear someone saying they are doing a parameter sweep from a probability distribution and want to throw out the jobs that run for a long time (or use a lot of memory, or whatever), alarm bells go off.  Be aware that by removing these longer running jobs, you may be introducing a significant statistical bias in your sample that could render your scientific results as flawed!  Proceed with caution....

2. Perhaps instead of doing "condor_rm 1.10", you could do "condor_hold 1.10" which would free the cores and hopefully sstill keep PEST/YAMR/Panter from submitting another jobs with the same parameters? 
Then once all your modeling work is done, you could simply condor_rm everything.

3. How do you go about telling if the jobs will be worthless to you after the first hour?  I ask because perhaps it can be something that is easily automated... ie you can tell HTCondor to automatically condor_hold or condor_rm jobs that run for more than hour...

Hope the above thoughts help,