[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor and condor_ganglia issues



Hi,


I have the following issues that I need help with.


About my setup:

I have a Ganglia gmetad that handles the regular metrics (cpu, mem, etc) that are sent by gmond's on execute nodes. This part is fine. I now wish to add HTCondor to same gmetad and I need help. This gmetad is on the same host as collector and so on this host I enabled condor_gangliad. (gmetad, collector and condor_gangliad on same host)


A)

GangliadLog has the following set lines repeating. Clip is pasted below. What is the my_popenv error about ?


my_popenv: Failed to exec “gstat, errno=2 (No such file or directory)
Failed to execute “gstat --mpifile --all --gmond_ip=127.0.0.1 --gmond_port=8649”: No such file or directory
Got 329 daemon ads
Heartbeats sent: 0
Starting update...
Heartbeats sent: 0


When I run the gstat command, it shows output as below:


[roo@ce ~]# gstat --all --mpifile --gmond_ip=127.0.0.1 --gmond_port=8649

wn06.my.domain:128
wn05.my.domain:128
wn04.my.domain:128
wn03.my.domain:128
wn02.my.domain:128
wn01.my.domain:128
wn08.my.domain:64
wn07.my.domain:64
localhost.localdomain:8


B)

Is condor_gangliad a routine "data source" for Ganglia's gmetad"? What should be the "data_source" declaration in gmetad.conf?

I have gmond that listens on 8649 for the metrics from the execute nodes. The host running collector itself appears as "localhost" (see above). I tried to understand from this tutorial video at https://research.cs.wisc.edu/htcondor/tutorials/videos/2014/Ganglia.html but I could not read the Ganglia screen shown in the video.


Thanks

Nagaraj




On 7/28/21 3:14 AM, John M Knoeller wrote:

That sounds like something outside of HTCondor is starting one of those condor_gangliad processes.   

What is the parent PID of each?  perhaps we can track back from there...

I don't really know what gstat is, let me ask around and see if any of my colleagues know.

-tj


From: pn <pn@xxxxxxxxxxx>
Sent: Tuesday, July 27, 2021 11:54 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor and condor_ganglia issues
 
More about condor_gangliad process:

I stopped condor (systemctl stop). and after that condor_gangliad was
still there. I then killed it. And restarted condor after adding
GANGLIAD to DAEMON_LIST. Sure enough condor_gangliad was one of the
processes. But strangely, less than a second a second condor_gangliad
appeared.

[root@simclu-ce ~]# ps -ea|grep gangliad
2592326 ?        00:00:00 condor_gangliad
2592334 ?        00:00:00 condor_gangliad

Would it be because I have a wrong configuration?

Secondly, Gangliadlog has this error:

07/27/21 21:40:23 my_popenv: Failed to exec “gstat, errno=2 (No such
file or directory)
07/27/21 21:40:23 Failed to execute “gstat --all --mpifile
--gmond_ip=192.168.55.79 --gmond_port=8652”: No such file or directory

What file is it complaining about? I replaced "gstat" with "/bin/gstat"
and the error shows up again "Failed to exec "/bin/gstat, .."

-
Nagaraj




On 2021-07-27 21:15, John M Knoeller wrote:
> I'm not sure why the condor_gangliad would be running if you did not
> add it to your daemon list.   But the error is because you need to put
> GANGLIAD in your daemon list not GANGLIA_D.
>
>  Instructions for how to handle the case where the metad is on a
> different machine than the condor_collector is here
>
>  Monitoring — HTCondor Manual 9.1.0 documentation [1]
>
>  -tj
>
> -------------------------
>
> FROM: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of
> Nagaraj Panyam <pn@xxxxxxxxxxx>
> SENT: Tuesday, July 27, 2021 6:34 AM
> TO: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
> SUBJECT: [HTCondor-users] HTCondor and condor_ganglia issues
>
> Hi,
>
> I am trying to configure HTcondor's ganglia monioring. In that
> context, I see something I do not understand.
>
> Firstly, I see the process condor_gangliad even though it is not in
> the DAEMON_LIST. config_val_dump shows DAEMON_LIST = MASTER COLLECTOR
> NEGOTIATOR SCHEDD). Is this expected?
>
> Secondly, When I specifically add GANGLIA_D to DAEMON_LIST in condor
> config file, the error given below shows up in MasterLog. Where do I
> add the executable path? We  have CONDOR_VERSION = 8.9.13
>
>> GANGLIA_D is in the DAEMON_LIST parameter, but there is no
>> executable path for it defined in the config files!
>> ERROR "Must have the path to GANGLIA_D defined." at line 1606 in
>> file
>>
> /var/lib/condor/execute/slot1/dir_19111/userdir/.tmp9djsO9/BUILD/condor-8.9.13/src/condor_master.V6/masterDaemon.cpp
>
> Thirdly, after resolving above issues, what is the scheme to hookup
> HTCondor's monitoring to existing Ganglia? We will have
> condor_gangliad on same machine as Collector, and Ganglia's metad
> running on a different host.
>
> Thanks
>
> Nagaraj
>
>
>
> Links:
> ------
> [1]
> https://htcondor.readthedocs.io/en/latest/admin-manual/monitoring.html?highlight=gangliad#ganglia
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/