[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor and condor_ganglia issues



Hi,

I am running as user root.

I did try using full path to gstat. Its is /bin/gstat, as installed by the rpm.
Here is the clipping from GangliadLog:

07/28/21 22:14:28 Starting update...
07/28/21 22:14:28 my_popenv: Failed to exec â/bin/gstat, errno=2 (No such file or directory) 07/28/21 22:14:28 Failed to execute â/bin/gstat --mpifile --all --gmond_ip=127.0.0.1 --gmond_port=8649â: No such file or directory
07/28/21 22:14:28 Got 318 daemon ads
07/28/21 22:14:28 Heartbeats sent: 0
07/28/21 22:14:48 Starting update...
07/28/21 22:14:48 Heartbeats sent: 0
07/28/21 22:15:08 Starting update...
07/28/21 22:15:08 Heartbeats sent: 0


Apart from that, I see that "my_popenv: Failed to exec" figures in a few error reports related with HTC. For example,
https://www-auth.cs.wisc.edu/lists/htcondor-users/2016-November/msg00143.shtml
I do not know if any of those apply to my case.

-
Nagaraj



On 2021-07-28 21:34, John M Knoeller wrote:
I wonder if the path of your interactive shell is unusual.   (are you
really running commands as the user roo?)

try running this command

      which gstat

 What does it return?

 You could try configuring the GANGLIA_GSTAT_COMMAND to have the full
path to the gstat command by adding something like this to your condor
configuration.

     GANGLIA_GSTAT_COMMAND=/path/to/gstat --all --mpifile
--gmond_ip=localhost --gmond_port=8649

 -tj

-------------------------

FROM: Nagaraj Panyam <pn@xxxxxxxxxxx>
SENT: Wednesday, July 28, 2021 8:11 AM
TO: John M Knoeller <johnkn@xxxxxxxxxxx>; HTCondor-Users Mail List
<htcondor-users@xxxxxxxxxxx>
SUBJECT: Re: [HTCondor-users] HTCondor and condor_ganglia issues

Hi,

I have the following issues that I need help with.

About my setup:

I have a Ganglia gmetad that handles the regular metrics (cpu, mem,
etc) that are sent by gmond's on execute nodes. This part is fine. I
now wish to add HTCondor to same gmetad and I need help. This gmetad
is on the same host as collector and so on this host I enabled
condor_gangliad. (gmetad, collector and condor_gangliad on same host)

A)

GangliadLog has the following set lines repeating. Clip is pasted
below. What is the my_popenv error about ?

my_popenv: Failed to exec âgstat, errno=2 (No such file or
directory)
Failed to execute âgstat --mpifile --all --gmond_ip=127.0.0.1
--gmond_port=8649â: No such file or directory
Got 329 daemon ads
Heartbeats sent: 0
Starting update...
Heartbeats sent: 0

When I run the gstat command, it shows output as below:

[roo@ce ~]# gstat --all --mpifile --gmond_ip=127.0.0.1
--gmond_port=8649

wn06.my.domain:128
wn05.my.domain:128
wn04.my.domain:128
wn03.my.domain:128
wn02.my.domain:128
wn01.my.domain:128
wn08.my.domain:64
wn07.my.domain:64
localhost.localdomain:8

B)

Is condor_gangliad a routine "data source" for Ganglia's gmetad"? What
should be the "data_source" declaration in gmetad.conf?

I have gmond that listens on 8649 for the metrics from the execute
nodes. The host running collector itself appears as "localhost" (see
above). I tried to understand from this tutorial video at
https://research.cs.wisc.edu/htcondor/tutorials/videos/2014/Ganglia.html
[2] but I could not read the Ganglia screen shown in the video.

Thanks

Nagaraj

On 7/28/21 3:14 AM, John M Knoeller wrote:

That sounds like something outside of HTCondor is starting one of
those condor_gangliad processes.

What is the parent PID of each?  perhaps we can track back from
there...

I don't really know what gstat is, let me ask around and see if any
of my colleagues know.

-tj

-------------------------

FROM: pn <pn@xxxxxxxxxxx>
SENT: Tuesday, July 27, 2021 11:54 AM
TO: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
CC: John M Knoeller <johnkn@xxxxxxxxxxx>
SUBJECT: Re: [HTCondor-users] HTCondor and condor_ganglia issues

More about condor_gangliad process:

I stopped condor (systemctl stop). and after that condor_gangliad
was
still there. I then killed it. And restarted condor after adding
GANGLIAD to DAEMON_LIST. Sure enough condor_gangliad was one of the
processes. But strangely, less than a second a second
condor_gangliad
appeared.

[root@simclu-ce ~]# ps -ea|grep gangliad
2592326 ?        00:00:00 condor_gangliad
2592334 ?        00:00:00 condor_gangliad

Would it be because I have a wrong configuration?

Secondly, Gangliadlog has this error:

07/27/21 21:40:23 my_popenv: Failed to exec âgstat, errno=2 (No
such
file or directory)
07/27/21 21:40:23 Failed to execute âgstat --all --mpifile
--gmond_ip=192.168.55.79 --gmond_port=8652â: No such file or
directory

What file is it complaining about? I replaced "gstat" with
"/bin/gstat"
and the error shows up again "Failed to exec "/bin/gstat, .."

-
Nagaraj

On 2021-07-27 21:15, John M Knoeller wrote:
I'm not sure why the condor_gangliad would be running if you did
not
add it to your daemon list.   But the error is because you need to
put
GANGLIAD in your daemon list not GANGLIA_D.

Instructions for how to handle the case where the metad is on a
different machine than the condor_collector is here

Monitoring â HTCondor Manual 9.1.0 documentation [1]

-tj

-------------------------

FROM: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on
behalf of
Nagaraj Panyam <pn@xxxxxxxxxxx>
SENT: Tuesday, July 27, 2021 6:34 AM
TO: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
SUBJECT: [HTCondor-users] HTCondor and condor_ganglia issues

Hi,

I am trying to configure HTcondor's ganglia monioring. In that
context, I see something I do not understand.

Firstly, I see the process condor_gangliad even though it is not
in
the DAEMON_LIST. config_val_dump shows DAEMON_LIST = MASTER
COLLECTOR
NEGOTIATOR SCHEDD). Is this expected?

Secondly, When I specifically add GANGLIA_D to DAEMON_LIST in
condor
config file, the error given below shows up in MasterLog. Where do
I
add the executable path? We  have CONDOR_VERSION = 8.9.13

GANGLIA_D is in the DAEMON_LIST parameter, but there is no
executable path for it defined in the config files!
ERROR "Must have the path to GANGLIA_D defined." at line 1606 in
file



/var/lib/condor/execute/slot1/dir_19111/userdir/.tmp9djsO9/BUILD/condor-8.9.13/src/condor_master.V6/masterDaemon.cpp

Thirdly, after resolving above issues, what is the scheme to
hookup
HTCondor's monitoring to existing Ganglia? We will have
condor_gangliad on same machine as Collector, and Ganglia's metad
running on a different host.

Thanks

Nagaraj



Links:
------
[1]


https://htcondor.readthedocs.io/en/latest/admin-manual/monitoring.html?highlight=gangliad#ganglia
[1]
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


Links:
------
[1]
https://htcondor.readthedocs.io/en/latest/admin-manual/monitoring.html?highlight=gangliad#ganglia
[2] https://research.cs.wisc.edu/htcondor/tutorials/videos/2014/Ganglia.html