[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [EXTERNAL] Re: Make runs fail?



I have identified the jobs I want to stop. I use condor_hold cluster.process, which frees up cores and allows other processes to start. I assume the other processes are new parameter sets, but I am having a hard time confirming (I do not have access to any files on the nodes).

Since I am using the vanilla universe, the job is not killed as it would be in standard universe. The only solution I can come up with (until I move onto something more complex as time allows) is to wait until every parameter set has been submitted and then condor_rm the jobs individually. If I condor_rm before all sets have been submitted, the old/faulty sets just get resubmitted. Am I on the right track?

Thanks for the help

On Fri, Oct 19, 2018 at 11:50 AM Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx> wrote:
Wes, feel free to shoot me any questions you have about the presentation. The slides depend to a noticeable degree on the content of my live presentation, as you have seen, and Iâm happy to fill you in on the spoken-word details that the slides donât encompass. If you can read the PowerPoint version, I think there may be some notes attached to the slides, though those may also be mildly to moderately impenetrable.

The gist is that youâd be able to tailor the Python script youâre using into an update_job_info hook (with the cooperation of your pool administrator) that would be invoked eight seconds after startup and every five minutes thereafter during the job run. Youâd then use âcondor_chirp set_job_attrâ in the script to set a job attribute such as âMassBalanceErrorâ based on what youâve read from the runâs output with the Python script, and then set a âperiodic_holdâ _expression_ to automatically hold the job if the error exceeds a certain threshold:

Periodic_hold = MassBalanceError > 100
Periodic_hold_reason = Mass balance error too high

You could also do this externally to the job, with your script calling condor_q to find each job's log file, and condor_qedit (or the HTCondor Python bindings) to manage the MassBalanceError (or whatever) attribute, perhaps submitted as a scheduler or local universe job along with the main job. I find the update_job_info hook is more tidy and more scalable, albeit requiring a pool admin to set a <tag>_HOOK_UPDATE_JOB_INFO value in the configuration which not everyone is able to arrange.

Either way, your supervisor script shouldn't take direct action on the job, since that introduces an array of failure modes and potential problems that make the script's code needlessly complicated, as well as separating the policy decisions from the job by removing them to the script where they're potentially less obvious. The script should only set attributes in the job and let the job's policy expressions direct the actions of the starter daemon.


Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Kitlasten, Wesley via HTCondor-users
Sent: Friday, October 19, 2018 2:17 PM
To: tannenba@xxxxxxxxxxx
Cc: Kitlasten, Wesley <wkitlasten@xxxxxxxx>; htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] [EXTERNAL] Re: Make runs fail?

Hi Todd,

1) I too heard those alarm bells! The model is highly non-linear and it just does not converge for certain parameter combinations. As such, the "results" of the model with those parameter combination are inherently flawed (i.e. no point in considering fluxes when your mass balance errors are 200%, right?). With a better understanding of the model I could potentially better inform the covariance matrix and avoid those combinations that fail to converge, but so far I have been unsuccessful.

2 and 3) These are helpful. I use a python script to extract the elapsed time of the simulation and the number of iterations, then estimate how long the model will take to complete. I can likely use Michael's presentation to guide me further, although honestly I only grok about 60% or less of it!

Thank you.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Wes Kitlasten
United States Geological Survey
2730 N. Deer Run Road
Carson City, NV 89701
(775) 887-7711