Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Removing nodes from pool?

Date: Tue, 11 Jul 2017 09:35:26 +0200
From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
Subject: [HTCondor-users] Removing nodes from pool?

Good morning,

our 400+ node HTCondor pool currently sees a lot of OOM conditions.
Apparently, the memory in use as detected by the starter is way below the
actual memory consumption by the jobs - I'm constantly running out
of swap, and in a number of cases cannot connect to the nodes any longer.
At some point, the jobs will fail on their own, and enter Hold state
(because there's no node matching the last memory footprint) - and the
node will be freed up for yet another greedy job.

I have no means to set START=False in between, thus I cannot guarantee
the node didn't suffer from damage to the OS itself. (Setting START
would require remote access to run condor_reconfig, which fails.)
Is there a way to remove a node from the pool from the side of the
master node? Most HPC schedulers have it, but for HTCondor I cannot
find such a feature - condor_drain is close but still wants to talk
to the node (and apparently isn't graceful enough).

There must be a way to exclude rogue nodes from a pool. Any suggestions?


Thanks,
 Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1
D-14476 Potsdam-Golm
Germany
~~~
Fon: +49-331-567 7274
Fax: +49-331-567 7298
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Follow-Ups:
- Re: [HTCondor-users] Removing nodes from pool?
  - From: Fischer, Max (SCC)

Prev by Date: Re: [HTCondor-users] Schedd dies with an exception when communicating with IPv6 startd
Next by Date: [HTCondor-users] daemons not using IPv4 on unusable IPv6 network
Previous by thread: [HTCondor-users] Need help in setting up condor parallel universe
Next by thread: Re: [HTCondor-users] Removing nodes from pool?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] Removing nodes from pool?