[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Make runs fail?
Wes, my 2017 HTCondor Week presentation may be of interest with respect to Todd's suggestion #3:
>>Detecting & Managing Job Events & Progress with an HTCondor Update Job Info Hook<<
I'd have to run through an IP review board to contribute my hook_checkfile code which can take a while, but hopefully the presentation will be enough to give you some ideas on ways to catch the bad jobs early.
Michael V. Pelletier
Digital Transformation & Innovation
Integrated Defense Systems
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum
Sent: Friday, October 19, 2018 11:52 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] Make runs fail?
Some quick thoughts on the below --
1. Whenever I hear someone saying they are doing a parameter sweep from a probability distribution and want to throw out the jobs that run for a long time (or use a lot of memory, or whatever), alarm bells go off. Be aware that by removing these longer running jobs, you may be introducing a significant statistical bias in your sample that could render your scientific results as flawed! Proceed with caution....
2. Perhaps instead of doing "condor_rm 1.10", you could do "condor_hold 1.10" which would free the cores and hopefully sstill keep PEST/YAMR/Panter from submitting another jobs with the same parameters?
Then once all your modeling work is done, you could simply condor_rm everything.
3. How do you go about telling if the jobs will be worthless to you after the first hour? I ask because perhaps it can be something that is easily automated... ie you can tell HTCondor to automatically condor_hold or condor_rm jobs that run for more than hour...
Hope the above thoughts help,