[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Grafana + HTCondor Job Metrics



Hi Kevin,

I cannot use the systemd unit file because I am using Centos6 and not Centos7. I tried running the condor_probe.py script and this is the error I got. In fact, throughout my logs this is what I keep getting.

"Traceback (most recent call last):
 File "./condor_probe.py", line 12, in <module>
ÂÂÂ import condor
 File "/directory/homedir/name/probes/bin/condor/__init__.py", line 4, in <module>
ÂÂÂ from .status import get_pool_status
 File "/directory/homedir/name/probes/bin/condor/status.py", line 5, in <module>
ÂÂÂ import classad
ImportError: No module named classad"

I'm not sure what you mean by the "--once" flag, but I still wasn't able to run the script doing './condor_probe.py'

Thanks,
Uchenna


On Fri, Feb 24, 2017 at 2:51 PM, Kevin Retzke <kretzke@xxxxxxxx> wrote:

Hi, author of Fifemon here. Do you really want to run Supervisor and the probes as root? We run as much as possible in user space.Â

SeeÂhttp://supervisord.org/running.html#runtime-security.


If you don't want to deal with supervisor, you can instead run the probeÂwithÂcron by passing the '--once' flag to 'condor_probe.py'.

Or, here's a simpleÂsystemd unit file you could use (untested; change user, group, and command as appropriate):Âhttps://gist.github.com/retzkek/0dcd7b19548de96531bd5362c51ee2a6

Init script is left as an exercise ð


Please let me know if you have any otherÂquestions or issues!


Regards,

Kevin Retzke





From: HTCondor-users <htcondor-users-bounces@cs.wisc.edu> on behalf of Uchenna Ojiaku - NOAA Affiliate <uchenna.ojiaku@xxxxxxxx>
Sent: Friday, February 24, 2017 1:20 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Grafana + HTCondor Job Metrics
Â

Hello,

Has anyone configured grafana to display htcondor job/slot metrics? I found this online. But my supervisord isn't working. I get this is error "Supervisord is running as root and it is searching" So I try killing any running supervisord processes but I still get the same error. Then I tried unlinking with 'sudo unlink /tmp/supervisor.sock' but then I get this notice "unlink: cannot unlink '/tmp/supervisor.sock': No such file or directory"

If you know of a better software/method, I will be happy to look into it.


Reference:

Fifemon

Collect HTCondor statistics and report into time-series database. All modules support Graphite, and there is some support for InfluxDB.

Additionally report select job and slot Classads into Elasticsearch via Logstash.

Note: this is a fork of the scripts used for monitoring the HTCondor pools at Fermilab, and while generally intended to be "generic" for any pool still may require some tweaking to work well for your pool.

Copyright Fermi National Accelerator Laboratory (FNAL/Fermilab). See LICENSE.txt.

Requirements

For current job and slot state:

Installation

Assuming HTCondor and Python virtualenv packages are already installed:

cd $INSTALLDIR
git clone https://github.com/fifemon/probes
cd probes
virtualenv --system-site-packages venv
source venv/bin/activate
pip install supervisor influxdb

Optionally, for crash mails:

pip install superlance

Configuration

Condor metrics probe

Example probe config is in etc/condor-probe.cfg:

[probe]
interval = 240   # how often to send data in seconds
retries = 10     # how many times to retry condor queries
delay = 30       # seconds to wait beteeen retries
test = false     # if true, data is output to stdout and not sent downstream
     # run one time and exit, i.e. for running wtih cron (not recommended)

[graphite]
enable = true                           # enable output to graphite
host = localhost                        # graphite host
port = 2004                             # graphite pickle port
namespace = clusters.mypool             # base namespace for metrics
meta_namespace = probes.condor-mypool   # namespace for probe metrics

[influxdb]
enable = false       # enable output to influxdb (not fully supported)
host = localhost     # influxdb host
port = 8086          # influxdb api port
db = test            # influxdb database
tags = foo:bar       # extra tags to include with all metrics (comma-separated key:value)

[condor]
pool = localhost            # condor pool (collector) to query
post_pool_status = true     # collect basic daemon metrics
post_pool_slots = true      # collect slot metrics
post_pool_glideins = false  # collect glidein-specific metrics
post_pool_prio = false      # collect user priorities
post_pool_jobs = false      # collect job metrics
use_gsi_auth = false        # set true if collector requires authentication
X509_USER_CERT = ""         # location of X.509 certificate to authenticate to condor with
X509_USER_KEY = ""          # private key

Supervisor

Example supervisor config is in etc/supervisord.conf, it can be used as-is for basic usage. Reuqires some modification to enable crashmails or to report job and slot details to elsaticsearch (via logstash).

Job and slot state

The scripts that collect raw job and slot records into elasticsearch are much simpler than the metrics probe - simply point at your pool with --pool and JSON records are output to stdout. We use logstash to pipe the output to Elasticsearch; see etc/logstash-fifemon.conf.

Running

Using supervisor:

cd $INSTALLDIR/probes
source venv/bin/activate

If using influxdb:

export INFLUXDB_USERNAME=<username>
export INFLUXDB_PASSWORD=<password>

Start supervisor:

supervisord

Thanks,

Uchenna 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/