[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] how to drain offline nodes ?



Hi All,

 

Iâm back on this issue. I havenât tried Brians suggestion yet, but Iâm a bit concerned about the fact that preventing jobs from entering a node would require that this node âagreesâ/âdecided itâs okâ.

Off course, I guess we can design a startd _expression_ that will only be true when every single condition for production is met, but I like the idea of being able to administratively deny a host from running jobs. No matter what the cron scripts say (those scripts cannot be aware I am willing to physically move the server for instanceâ).

 

I tried the DENY_WRITE method, and I couldnât get it to work.

 

I put this in my config (I put it on both the sched and controller) :

# cat /etc/condor/config.d/99_drained.config

DENY_WRITE  = $(DENY_WRITE), *@wn312.<<my domain - edited>>

 

In my config, I have this also :

/etc/condor/config.d/10_security.config:HOSTALLOW_WRITE =

/etc/condor/config.d/10_security.config:ALLOW_WRITE = $(CMS), $(CES), $(WNS)

/etc/condor/config.d/10_security.config:SCHEDD.ALLOW_WRITE = $(USERS), $(CES)

/etc/condor/config.d/10_security.config:SCHEDD.DENY_WRITE = nobody@$(UID_DOMAIN)

/etc/condor/config.d/10_security.config:SCHEDD.SEC_WRITE_AUTHENTICATION_METHODS = FS,PASSWORD

 

WNS being defined as âwn*@my domainâ

 

Despite this config, the node is full of jobsâ ?

I havenât tried the âNEGOTIATOR_SLOT_CONSTRAINTâ way, but I donât really like the regex approach as it will be probably hard to design anything that allows me to simply run something like âcondor_deny infrigingnode.domainâ (and the reverse, something to clear the deny)

 

Any idea what I may be doing wrong with the DENY_WRITE ? (yes, I sent a condor_reconfig to the daemons)

 

Thanks

 

De : HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] De la part de Brian Bockelman
Envoyà: dimanche 15 mai 2016 20:27
à: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Objet : Re: [HTCondor-users] how to drain offline nodes ?

 

Hi Frederic,

 

Locally, we have the START _expression_ reference an attribute that is calculated based on the outcome of periodic startd cron tasks.  This way, if the health check hasn't run, the attribute is missing - and hence we can keep the node idle.

 

It's a good way to wait for things like CRLs, a puppet run, or successful mount of the SE.

 

Would this help in your case?

 

Brian

Sent from my iPhone


On May 10, 2016, at 6:59 AM, SCHAER Frederic <frederic.schaer@xxxxxx> wrote:

Hi,

 

Letâs say weâve had a few nodes offline for a substantial amount of time.

Weâd like to restart them nowâ. But before they start processing jobs, weâd like to make sure x509 CRLs are updated (thereâs a 6H cron, but thatâs not an @boot cron), and to update the sytem/kernel and reboot the nodes on those new kernelsâ

Last time I tried to drain a node using condor_drain, I got an error telling meâ the node was offline (or unreachable, or something like that).

 

Question : whatâs the correct way to handle this situation ?

I was told to put a START=false in the startd configsâ but thatâs not the correct way for me as it requires starting up the nodes to change the configs, hence the nodes will likely eat and fail a few jobs before I manage to update all configsâ

 

Any ideas (other than : âreinstallâ ;) ) ?

 

Thanks && regards

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/