[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Removing nodes from pool?



Good morning,

our 400+ node HTCondor pool currently sees a lot of OOM conditions.
Apparently, the memory in use as detected by the starter is way below the
actual memory consumption by the jobs - I'm constantly running out
of swap, and in a number of cases cannot connect to the nodes any longer.
At some point, the jobs will fail on their own, and enter Hold state
(because there's no node matching the last memory footprint) - and the
node will be freed up for yet another greedy job.

I have no means to set START=False in between, thus I cannot guarantee
the node didn't suffer from damage to the OS itself. (Setting START
would require remote access to run condor_reconfig, which fails.)
Is there a way to remove a node from the pool from the side of the
master node? Most HPC schedulers have it, but for HTCondor I cannot
find such a feature - condor_drain is close but still wants to talk
to the node (and apparently isn't graceful enough).

There must be a way to exclude rogue nodes from a pool. Any suggestions?


Thanks,
 Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1
D-14476 Potsdam-Golm
Germany
~~~
Fon: +49-331-567 7274
Fax: +49-331-567 7298
Mail: steffen.grunewald(at)aei.mpg.de
~~~