[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] how to drain offline nodes ?



Hi Frederic,

Locally, we have the START _expression_ reference an attribute that is calculated based on the outcome of periodic startd cron tasks.  This way, if the health check hasn't run, the attribute is missing - and hence we can keep the node idle.

It's a good way to wait for things like CRLs, a puppet run, or successful mount of the SE.

Would this help in your case?

Brian

Sent from my iPhone

On May 10, 2016, at 6:59 AM, SCHAER Frederic <frederic.schaer@xxxxxx> wrote:

Hi,

 

Letâs say weâve had a few nodes offline for a substantial amount of time.

Weâd like to restart them nowâ. But before they start processing jobs, weâd like to make sure x509 CRLs are updated (thereâs a 6H cron, but thatâs not an @boot cron), and to update the sytem/kernel and reboot the nodes on those new kernelsâ

Last time I tried to drain a node using condor_drain, I got an error telling meâ the node was offline (or unreachable, or something like that).

 

Question : whatâs the correct way to handle this situation ?

I was told to put a START=false in the startd configsâ but thatâs not the correct way for me as it requires starting up the nodes to change the configs, hence the nodes will likely eat and fail a few jobs before I manage to update all configsâ

 

Any ideas (other than : âreinstallâ ;) ) ?

 

Thanks && regards

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/