[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Defining `START` from external program
- Date: Mon, 21 Mar 2016 12:33:27 -0400
- From: Michael V Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Defining `START` from external program
wrote on 03/21/2016 12:07:37 PM:
> From: "Christopher J. Hanks" <christopher.hanks@xxxxxxxx>
> To: htcondor-users@xxxxxxxxxxx
> Date: 03/21/2016 12:10 PM
> Subject: [HTCondor-users] Defining `START` from
> Sent by: "HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx>
> I have a script which can determine whether a condor worker is correctly
> provisioned for accepting jobs (checks mount points, network
> availability,..). I would like to periodically run this script
> the resulting exit code the `START` parameter. In the past I
> successfully used a cronjob STARTD_ATTRS, however, this requires all
> submit files to appropriately put the parameter in their requirements
> Is this possible? Can anyone point me to documentation for this?
You're looking for the "startd_cron" functionality,
with examples starting on page 526 of the manual.
Using this, you can define a periodic job that can
produce a ClassAd as its output which is then incorporated into the slot's
machine ad. I have a number of such checks. Separating them out into multiple
jobs works best for me, that way each probe is kept very short and sweet,
and only the ones which actually need it run as root.
To stop jobs based on it, if you have a script which
runs ipmitool chassis status (via a setuid Perl script) and looks for disk
failures or a power and cooling fault over overload, and then sets a "ChassisFault"
boolean accordingly. In addition, the START _expression_ gets modified as
START = $(START) && (ChassisFault =!= True)
... and then if a fan, power supply, or disk fails,
or the system starts to overheat, the machine stops accepting jobs because
ChassisFault becomes true. A single error message attribute for the machine
can be set with a series of nested ifThenElse() statements.
Currently all my dynamic slots run the probes, but
there's probably a way I'm not thinking of to take the platform-health
attributes from the partitionable slot and apply them to the child dynamic
slots. In the meantime I'm just writing very efficient scripts.