Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor (down) nodes status

Date: Tue, 1 Jul 2014 14:20:20 +0000
From: <andrew.lahiff@xxxxxxxxxx>
Subject: Re: [HTCondor-users] condor (down) nodes status

Hi Frederic,

We just have a Nagios test on the central managers (using the Python API) that checks how many startds are in known to the collector and if this number is below a minimum threshold or not:

https://github.com/alahiff/ral-htcondor-nagios-plugins/blob/master/check_condor_wn.py

We also have a worker node health-check script, running as a startd cron, which checks various essential things on the worker nodes (e.g. CVMFS, if the local disk is fine, ...). The START expression takes this into account and only allows new jobs to start if the worker node is healthy. The above Nagios test also checks how many worker nodes are healthy and therefore allowed to run new jobs, and if this is below a minimum threshold.

Our Nagios checks for the negotiator, collectors and schedds are in the same repository.

Regards,
Andrew.

________________________________
From: SCHAER Frederic [frederic.schaer@xxxxxx]
Sent: Tuesday, July 01, 2014 1:33 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] condor (down) nodes status

Hi,

Ah, not great…  I guess I’d be able to work that around with a script parsing the history (but parsing classads might not be that easy for the newbies that I am), or even just by building an auto-updated “nodes” file with puppet...
I’m wondering though how people do debug batch issues if they can’t even identify there are failing nodes from a batchsystem point of view ?

I guess people have monitoring scripts checking for the presence of a stard process (at least), and probably some other trivial things (but which ones ?) in order to be sure the start processes are correctly registered in the pool ?

Regards

De : HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] De la part de Marc Volovic
Envoyé : mardi 1 juillet 2014 12:42
À : HTCondor-Users Mail List
Objet : Re: [HTCondor-users] condor (down) nodes status

You can see drained nodes with condor_status.
For nodes that are down, that is a more difficult question – I'd do it using an external means.

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of SCHAER Frederic
Sent: Tuesday, July 01, 2014 1:14 PM
To: htcondor-users@xxxxxxxxxxx<mailto:htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] condor (down) nodes status

Hi,

I’m used to torque, in which there is a “pbsnodes –l” command that displays nodes that are down or drained.
Strangely, I don’t find how to see this information in condor : what would be the condor way of finding this information ?

I’m sure this can become hard when the pool is dynamic, but even then there must be traces of nodes which belonged to the pool “one day” or in the last X days ?

Thanks
-- 
Scanned by iCritical.

References:
- [HTCondor-users] condor (down) nodes status
  - From: SCHAER Frederic
- Re: [HTCondor-users] condor (down) nodes status
  - From: Marc Volovic
- Re: [HTCondor-users] condor (down) nodes status
  - From: SCHAER Frederic

Prev by Date: Re: [HTCondor-users] condor (down) nodes status
Next by Date: [HTCondor-users] [resubmit jobs]
Previous by thread: Re: [HTCondor-users] condor (down) nodes status
Next by thread: [HTCondor-users] [resubmit jobs]
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] condor (down) nodes status