[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Grafana + HTCondor Job Metrics



The config file looks fine, you just need to specify it as an argument to the probe on the command line, e.g.

./condor_probe.py --test ../etc/condor-probe.cfg


(I included the --test flag, which will cause it to run once and output extra debug information, without sending the data to graphite)

Regards,
Kevin


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Uchenna Ojiaku - NOAA Affiliate <uchenna.ojiaku@xxxxxxxx>
Sent: Monday, February 27, 2017 12:11 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Grafana + HTCondor Job Metrics
 
Hi Kevin,

I modified the etc/condor-probe.cfg file and the error was the result I got. Is how I modified the file the issue?

[probe]
interval = 240
retries = 10
delay = 30
test = false
>
[graphite]
enable = true
host = xxxxxxxxxxxxxxx
port = 2004
namespace = clusters.mypool
meta_namespace = probes.condor-mypool

[influxdb]
enable = true
host = xxxxxxxxxxxxxxxxxxxxxxxx
port = 8086
username = xxxxxxxxxxxxxxxxx
password = xxxxxxxxxxxxxxxxxxx

[condor]
pool = localhost
post_pool_status = true
post_pool_slots = true
post_pool_glideins = false
post_pool_prio = false
post_pool_jobs = true
use_gsi_auth = false
X509_USER_CERT = ""
X509_USER_KEY = ""


Thanks,

Uchenna


On Mon, Feb 27, 2017 at 12:53 PM, Kevin Retzke <kretzke@xxxxxxxx> wrote:

You'll need to pass it a config file. See etc/condor-probe.cfg for an example, and documentation in the README or at https://fifemon.github.io/probes/

Sorry for the terrible error message, I opened an issue to fix that.

Regards,
Kevin



From: HTCondor-users <htcondor-users-bounces@cs.wisc.edu> on behalf of Uchenna Ojiaku - NOAA Affiliate <uchenna.ojiaku@xxxxxxxx>
Sent: Monday, February 27, 2017 11:19 AM

To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Grafana + HTCondor Job Metrics
 
Thanks Luke.

Kevin, After installing condor-python successfully and updating the condor-probes config file I did:

(venv2.0)[xxxxxxxxxxxxxx@xxxxxxxxxx bin]$ ./condor_probe.py
Traceback (most recent call last):
  File "./condor_probe.py", line 147, in <module>
    opts = get_options()
  File "./condor_probe.py", line 118, in get_options
    'pool':              config.get("condor", "pool"),
  File "/usr/lib64/python2.6/ConfigParser.py", line 556, in get
    raise NoSectionError(section)
ConfigParser.NoSectionError: No section: 'condor'

What could be the issue?

Thanks,

Uchenna


On Mon, Feb 27, 2017 at 11:04 AM, L Kreczko <L.Kreczko@xxxxxxxxxxxxx> wrote:
Hi Uchenna,

I see you are already using virtualenv for some isolation.
If you would be happy to move to miniconda instead, you can try [1].
The recipe uses python 2.7 within the condor environment as well as the 8.4.2 python bindings.
The channel has also the version 8.4.3, 8.4.10, 8.4.11 and 8.6.

Cheers,
Luke


[1]
bash miniconda.sh -b -p <where you want your conda installation to be>
export PATH=<path to miniconda>/bin:$PATH

# it is important that you use conda-forge for the boost package
# as the default does not come with libboost_python
conda create -n condor python=2.7
source activate condor
conda install -c kreczko -c conda-forge htcondor-python=8.4.2

On 27 February 2017 at 14:40, Uchenna Ojiaku - NOAA Affiliate <uchenna.ojiaku@xxxxxxxx> wrote:
Hi Kevin,

I am only able to use the machines that have condor-8.4.2 installed. The machines with later versions of condor are my personal machines. Do you have another suggestion?

Thanks,

Uchenna

On Sat, Feb 25, 2017 at 9:55 AM, Kevin Retzke <kretzke@xxxxxxxx> wrote:

You can run the probe on any machine that can talk to your central manager (it doesn't have to be part of of the pool itself), e.g. we monitor six or so pools from a dedicated VM. The version of condor installed on that machine doesn't have to be the same as what your pool is running (there's probably some caveats). So my suggestion is to run a separate "monitoring" host (this can be a VM at first, at minimum I'd recommend 2 cores, 4 GB RAM, and 200 GB disk, SSD if possible). Unless you keep it hidden, you'll probably soon find people saying "hey that's cool, can I send telemetry for my service x there?" so plan to scale up


A quick test to see if you can monitor a pool from a host would be to run 'condor_status -pool <cm host>'. If that works the probe should work.


Regards,

Kevin




From: HTCondor-users <htcondor-users-bounces@xxxxxxc.edu> on behalf of Uchenna Ojiaku - NOAA Affiliate <uchenna.ojiaku@xxxxxxxx>
Sent: Friday, February 24, 2017 9:37 PM

To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Grafana + HTCondor Job Metrics
 
I'm currently using condor-8.4.2 el6, I see condor-python is for condor-8.4.9. On my other machines, I am able to install condor-python because they have a later version of condor. What do you suggest I do for my machines that have condor-8.4.2 installed?

On Fri, Feb 24, 2017 at 9:40 PM, Kevin Retzke <kretzke@xxxxxxxx> wrote:

Where did you install condor from? It's in the official YUM repos (http://research.cs.wisc.edu/htcondor/yum/).

It's available for EL6 and 7 (we are currently running on SL6).


Regards,

Kevin





From: HTCondor-users <htcondor-users-bounces@xxxxxxc.edu> on behalf of Uchenna Ojiaku - NOAA Affiliate <uchenna.ojiaku@xxxxxxxx>
Sent: Friday, February 24, 2017 8:09 PM

To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Grafana + HTCondor Job Metrics
 
Hi Kevin,

I tried installing that initially. YUM could not find any package by that name "condor-python". Is that package for Centos6 or is it only for Centos7?

Thanks,

Uchenna

On Fri, Feb 24, 2017 at 4:57 PM, Kevin Retzke <kretzke@xxxxxxxx> wrote:

You need to have the HTCondor Python bindings installed ("condor-python" package).


Regards,

Kevin




From: HTCondor-users <htcondor-users-bounces@xxxxxxc.edu> on behalf of Uchenna Ojiaku - NOAA Affiliate <uchenna.ojiaku@xxxxxxxx>
Sent: Friday, February 24, 2017 3:49 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Grafana + HTCondor Job Metrics
 
Hi Kevin,

I cannot use the systemd unit file because I am using Centos6 and not Centos7. I tried running the condor_probe.py script and this is the error I got. In fact, throughout my logs this is what I keep getting.

"Traceback (most recent call last):
  File "./condor_probe.py", line 12, in <module>
    import condor
  File "/directory/homedir/name/probes/bin/condor/__init__.py", line 4, in <module>
    from .status import get_pool_status
  File "/directory/homedir/name/probes/bin/condor/status.py", line 5, in <module>
    import classad
ImportError: No module named classad"

I'm not sure what you mean by the "--once" flag, but I still wasn't able to run the script doing './condor_probe.py'

Thanks,
Uchenna


On Fri, Feb 24, 2017 at 2:51 PM, Kevin Retzke <kretzke@xxxxxxxx> wrote:

Hi, author of Fifemon here.  Do you really want to run Supervisor and the probes as root? We run as much as possible in user space. 

See http://supervisord.org/running.html#runtime-security.


If you don't want to deal with supervisor, you can instead run the probe with cron by passing the '--once' flag to 'condor_probe.py'.

Or, here's a simple systemd unit file you could use (untested; change user, group, and command as appropriate): https://gist.github.com/retzkek/0dcd7b19548de96531bd5362c51ee2a6

Init script is left as an exercise


Please let me know if you have any other questions or issues!


Regards,

Kevin Retzke





From: HTCondor-users <htcondor-users-bounces@xxxxxxc.edu> on behalf of Uchenna Ojiaku - NOAA Affiliate <uchenna.ojiaku@xxxxxxxx>
Sent: Friday, February 24, 2017 1:20 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Grafana + HTCondor Job Metrics
 

Hello,

Has anyone configured grafana to display htcondor job/slot metrics? I found this online. But my supervisord isn't working. I get this is error "Supervisord is running as root and it is searching" So I try killing any running supervisord processes but I still get the same error. Then I tried unlinking with 'sudo unlink /tmp/supervisor.sock' but then I get this notice "unlink: cannot unlink '/tmp/supervisor.sock': No such file or directory"

If you know of a better software/method, I will be happy to look into it.


Reference:

Fifemon

Collect HTCondor statistics and report into time-series database. All modules support Graphite, and there is some support for InfluxDB.

Additionally report select job and slot Classads into Elasticsearch via Logstash.

Note: this is a fork of the scripts used for monitoring the HTCondor pools at Fermilab, and while generally intended to be "generic" for any pool still may require some tweaking to work well for your pool.

Copyright Fermi National Accelerator Laboratory (FNAL/Fermilab). See LICENSE.txt.

Requirements

For current job and slot state:

Installation

Assuming HTCondor and Python virtualenv packages are already installed:

cd $INSTALLDIR
git clone https://github.com/fifemon/probes
cd probes
virtualenv --system-site-packages venv
source venv/bin/activate
pip install supervisor influxdb

Optionally, for crash mails:

pip install superlance

Configuration

Condor metrics probe

Example probe config is in etc/condor-probe.cfg:

[probe]
interval = 240   # how often to send data in seconds
retries = 10     # how many times to retry condor queries
delay = 30       # seconds to wait beteeen retries
test = false     # if true, data is output to stdout and not sent downstream
     # run one time and exit, i.e. for running wtih cron (not recommended)

[graphite]
enable = true                           # enable output to graphite
host = localhost                        # graphite host
port = 2004                             # graphite pickle port
namespace = clusters.mypool             # base namespace for metrics
meta_namespace = probes.condor-mypool   # namespace for probe metrics

[influxdb]
enable = false       # enable output to influxdb (not fully supported)
host = localhost     # influxdb host
port = 8086          # influxdb api port
db = test            # influxdb database
tags = foo:bar       # extra tags to include with all metrics (comma-separated key:value)

[condor]
pool = localhost            # condor pool (collector) to query
post_pool_status = true     # collect basic daemon metrics
post_pool_slots = true      # collect slot metrics
post_pool_glideins = false  # collect glidein-specific metrics
post_pool_prio = false      # collect user priorities
post_pool_jobs = false      # collect job metrics
use_gsi_auth = false        # set true if collector requires authentication
X509_USER_CERT = ""         # location of X.509 certificate to authenticate to condor with
X509_USER_KEY = ""          # private key

Supervisor

Example supervisor config is in etc/supervisord.conf, it can be used as-is for basic usage. Reuqires some modification to enable crashmails or to report job and slot details to elsaticsearch (via logstash).

Job and slot state

The scripts that collect raw job and slot records into elasticsearch are much simpler than the metrics probe - simply point at your pool with --pool and JSON records are output to stdout. We use logstash to pipe the output to Elasticsearch; see etc/logstash-fifemon.conf.

Running

Using supervisor:

cd $INSTALLDIR/probes
source venv/bin/activate

If using influxdb:

export INFLUXDB_USERNAME=<username>
export INFLUXDB_PASSWORD=<password>

Start supervisor:

supervisord

Thanks,

Uchenna 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
*********************************************************
  Dr Lukasz Kreczko           
  Research Associate
  Department of Physics
  Particle Physics Group

  University of Bristol

  HH Wills Physics Lab
  University of Bristol
  Tyndall Avenue
  Bristol
  BS8 1TL


  +44 (0)117 928 8724 
  
  A top 5 UK university with leading employers (2015)
  A top 5 UK university for research (2014 REF)
  A world top 40 university (QS Ranking 2015)
*********************************************************

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/