[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Removing nodes from pool?



Hi Steffen,

have a look at condor_advertise. The docs mention precisely your desired functionality:
http://research.cs.wisc.edu/htcondor/manual/current/condor_advertise.html

Cheers,
Max

> Am 11.07.2017 um 09:35 schrieb Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx>:
> 
> Good morning,
> 
> our 400+ node HTCondor pool currently sees a lot of OOM conditions.
> Apparently, the memory in use as detected by the starter is way below the
> actual memory consumption by the jobs - I'm constantly running out
> of swap, and in a number of cases cannot connect to the nodes any longer.
> At some point, the jobs will fail on their own, and enter Hold state
> (because there's no node matching the last memory footprint) - and the
> node will be freed up for yet another greedy job.
> 
> I have no means to set START=False in between, thus I cannot guarantee
> the node didn't suffer from damage to the OS itself. (Setting START
> would require remote access to run condor_reconfig, which fails.)
> Is there a way to remove a node from the pool from the side of the
> master node? Most HPC schedulers have it, but for HTCondor I cannot
> find such a feature - condor_drain is close but still wants to talk
> to the node (and apparently isn't graceful enough).
> 
> There must be a way to exclude rogue nodes from a pool. Any suggestions?
> 
> 
> Thanks,
> Steffen
> 
> 
> -- 
> Steffen Grunewald, Cluster Administrator
> Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
> Am MÃhlenberg 1
> D-14476 Potsdam-Golm
> Germany
> ~~~
> Fon: +49-331-567 7274
> Fax: +49-331-567 7298
> Mail: steffen.grunewald(at)aei.mpg.de
> ~~~
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature