[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Defining `START` from external program



"HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx> wrote on 03/21/2016 12:07:37 PM:

> From: "Christopher J. Hanks" <christopher.hanks@xxxxxxxx>

> To: htcondor-users@xxxxxxxxxxx
> Date: 03/21/2016 12:10 PM
> Subject: [HTCondor-users] Defining `START` from external program
> Sent by: "HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx>
>
> Hello,
>
> I have a script which can determine whether a condor worker is correctly
> provisioned for accepting jobs (checks mount points, network
> availability,..).  I would like to periodically run this script and tie
> the resulting exit code the `START` parameter.  In the past I have
> successfully used a cronjob STARTD_ATTRS, however, this requires all
> submit files to appropriately put the parameter in their requirements
> list.
>
> Is this possible?  Can anyone point me to documentation for this?

You're looking for the "startd_cron" functionality, with examples starting on page 526 of the manual.

Using this, you can define a periodic job that can produce a ClassAd as its output which is then incorporated into the slot's machine ad. I have a number of such checks. Separating them out into multiple jobs works best for me, that way each probe is kept very short and sweet, and only the ones which actually need it run as root.

To stop jobs based on it, if you have a script which runs ipmitool chassis status (via a setuid Perl script) and looks for disk failures or a power and cooling fault over overload, and then sets a "ChassisFault" boolean accordingly. In addition, the START _expression_ gets modified as follows:

START = $(START) && (ChassisFault =!= True)

... and then if a fan, power supply, or disk fails, or the system starts to overheat, the machine stops accepting jobs because ChassisFault becomes true. A single error message attribute for the machine can be set with a series of nested ifThenElse() statements.
Currently all my dynamic slots run the probes, but there's probably a way I'm not thinking of to take the platform-health attributes from the partitionable slot and apply them to the child dynamic slots. In the meantime I'm just writing very efficient scripts.

        -Michael Pelletier.
_