Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Make runs fail?

Date: Fri, 19 Oct 2018 17:45:05 +0000
From: Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Make runs fail?

Wes, my 2017 HTCondor Week presentation may be of interest with respect to Todd's suggestion #3:

>>Detecting & Managing Job Events & Progress with an HTCondor Update Job Info Hook<<
https://research.cs.wisc.edu/htcondor/HTCondorWeek2017/presentations/ThuPelletier_Monitoring.pdf

I'd have to run through an IP review board to contribute my hook_checkfile code which can take a while, but hopefully the presentation will be enough to give you some ideas on ways to catch the bad jobs early.

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum
Sent: Friday, October 19, 2018 11:52 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] Make runs fail?

Hi,

Some quick thoughts on the below --

1. Whenever I hear someone saying they are doing a parameter sweep from a probability distribution and want to throw out the jobs that run for a long time (or use a lot of memory, or whatever), alarm bells go off.  Be aware that by removing these longer running jobs, you may be introducing a significant statistical bias in your sample that could render your scientific results as flawed!  Proceed with caution....

2. Perhaps instead of doing "condor_rm 1.10", you could do "condor_hold 1.10" which would free the cores and hopefully sstill keep PEST/YAMR/Panter from submitting another jobs with the same parameters? 
Then once all your modeling work is done, you could simply condor_rm everything.

3. How do you go about telling if the jobs will be worthless to you after the first hour?  I ask because perhaps it can be something that is easily automated... ie you can tell HTCondor to automatically condor_hold or condor_rm jobs that run for more than hour...

Hope the above thoughts help,
Todd

References:
- [HTCondor-users] Make runs fail?
  - From: Kitlasten, Wesley
- Re: [HTCondor-users] Make runs fail?
  - From: Todd Tannenbaum

Prev by Date: Re: [HTCondor-users] Make runs fail?
Next by Date: Re: [HTCondor-users] [EXTERNAL] Re: Make runs fail?
Previous by thread: Re: [HTCondor-users] Make runs fail?
Next by thread: Re: [HTCondor-users] [EXTERNAL] Re: Make runs fail?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Make runs fail?