[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Drain HTCondor worker by setting instance metadata value



If you want to be able to run a command and force the STARTD to re-check the URL, you can setup your STARTD_CRON hook as a OneShot hook

with reconfig_rerun enabled.   STARTD_CRON hooks of this type will run when the daemon starts up, and also on reconfig.

 

If you are willing to actually run a daemon that does the checking, you could also consider running it as a ‘continuous’ cron hook.   This is a type of

STARTD_CRON hook that runs all of the time, and writes a new ClassAd to stdout when it wants to update the STARTD.

 

To do that you set the STARTD_CRON_*_MODE to WaitForExit, and have your daemon write to stdout only when it wants the STARTD to change state.   When your job writes “- update:true” to stdout, the STARTD will act on the output even if your cron hook does not exit. 

 

See

http://research.cs.wisc.edu/htcondor/manual/v8.7/4_4Hooks.html#52841

for the syntax.

 

-tj

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Sveinung Rundhovde
Sent: Sunday, September 3, 2017 12:37 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Drain HTCondor worker by setting instance metadata value

 

Hi,

I am setting up a system with a HTCondor pool running on OpenStack. I am trying to create a mechanism that enables draining workers by setting a value in their metadata to true (data made available to the VM via a URL). It should also be possible to make them start accepting jobs again by resetting this value.

So far I have come up with a couple of solutions that work, but not as well as I would like.

The first is to use job hooks. By setting PREPARE_JOB and JOB_EXIT job hooks for the starter with a script that sets the START parameter to false if the metadata value is set to true. This script spawns a daemon that checks the metadata regularly and sets START back to true if the metadata is set back to false. 

There are however a few issues with this solution. The PREPARE_JOB hook is executed after the job is already on the execute node, so even if START is set to false at this time the job will still run. I was able to solve this by making the script return a none zero exit value, thereby causing the job to be aborted (jobs are aborted if the PREPARE_JOB hook has none zero exit value). This is okey and works, but it is a bit "hacky", and jobs will be sent to machines and then aborted.

Another solution I tried was using cron jobs. I set it up to run periodic, and update the START value. The issue here is that there will be a delay before the draining is started. Of course the period can be set low, however this will put load on the system. 

Is there a better way to do this?