[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Make runs fail?

Wes, feel free to shoot me any questions you have about the presentation. The slides depend to a noticeable degree on the content of my live presentation, as you have seen, and Iâm happy to fill you in on the spoken-word details that the slides donât encompass. If you can read the PowerPoint version, I think there may be some notes attached to the slides, though those may also be mildly to moderately impenetrable.

The gist is that youâd be able to tailor the Python script youâre using into an update_job_info hook (with the cooperation of your pool administrator) that would be invoked eight seconds after startup and every five minutes thereafter during the job run. Youâd then use âcondor_chirp set_job_attrâ in the script to set a job attribute such as âMassBalanceErrorâ based on what youâve read from the runâs output with the Python script, and then set a âperiodic_holdâ expression to automatically hold the job if the error exceeds a certain threshold:

Periodic_hold = MassBalanceError > 100
Periodic_hold_reason = Mass balance error too high

You could also do this externally to the job, with your script calling condor_q to find each job's log file, and condor_qedit (or the HTCondor Python bindings) to manage the MassBalanceError (or whatever) attribute, perhaps submitted as a scheduler or local universe job along with the main job. I find the update_job_info hook is more tidy and more scalable, albeit requiring a pool admin to set a <tag>_HOOK_UPDATE_JOB_INFO value in the configuration which not everyone is able to arrange.

Either way, your supervisor script shouldn't take direct action on the job, since that introduces an array of failure modes and potential problems that make the script's code needlessly complicated, as well as separating the policy decisions from the job by removing them to the script where they're potentially less obvious. The script should only set attributes in the job and let the job's policy expressions direct the actions of the starter daemon.

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Kitlasten, Wesley via HTCondor-users
Sent: Friday, October 19, 2018 2:17 PM
To: tannenba@xxxxxxxxxxx
Cc: Kitlasten, Wesley <wkitlasten@xxxxxxxx>; htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] [EXTERNAL] Re: Make runs fail?

Hi Todd,

1) I too heard those alarm bells! The model is highly non-linear and it just does not converge for certain parameter combinations. As such, the "results" of the model with those parameter combination are inherently flawed (i.e. no point in considering fluxes when your mass balance errors are 200%, right?). With a better understanding of the model I could potentially better inform the covariance matrix and avoid those combinations that fail to converge, but so far I have been unsuccessful.

2 and 3) These are helpful. I use a python script to extract the elapsed time of the simulation and the number of iterations, then estimate how long the model will take to complete. I can likely use Michael's presentation to guide me further, although honestly I only grok about 60% or less of it!

Thank you.