I am setting up a system with a HTCondor pool running on OpenStack. I am trying to create a mechanism that enables draining workers by setting a value in their metadata to true (data made available to the VM via a URL). It should also be possible to make them start accepting jobs again by resetting this value.
So far I have come up with a couple of solutions that work, but not as well as I would like.
The first is to use job hooks. By setting PREPARE_JOB and JOB_EXIT job hooks for the starter with a script that sets the START parameter to false if the metadata value is set to true. This script spawns a daemon that checks the metadata regularly and sets START back to true if the metadata is set back to false.
There are however a few issues with this solution. The PREPARE_JOB hook is executed after the job is already on the execute node, so even if START is set to false at this time the job will still run. I was able to solve this by making the script return a none zero exit value, thereby causing the job to be aborted (jobs are aborted if the PREPARE_JOB hook has none zero exit value). This is okey and works, but it is a bit "hacky", and jobs will be sent to machines and then aborted.
Another solution I tried was using cron jobs. I set it up to run periodic, and update the START value. The issue here is that there will be a delay before the draining is started. Of course the period can be set low, however this will put load on the system.
Is there a better way to do this?