[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Fwd: Problems with HTCondor schedd or collector [tracking of submit machines]

I would like to add/re-iterate that there seems to be problems with a couple things, not just submit machines. If I run condor_status, I can see all condor machines and submit machines. If I run condor_status -master, I get a very small subset of the machines listed in the condor_status. Also, sometimes when I run condor_reconfig, I get an error that the master cannot be located even though the services are running and condor_status picks up the machine. For example:
C:\Users\odonnellm>condor_reconfig igskbacbwscdrs3
Can't find address for master igskbacbwscdrs3.gs.doi.net
Perhaps you need to query another pool.

So, it seems like this is an issue with the collector but not really sure.


---------- Forwarded message ----------
From: O'Donnell, Michael <odonnellm@xxxxxxxx>
Date: Fri, May 31, 2013 at 6:37 AM
Subject: Problems with HTCondor schedd or collector [tracking of submit machines]
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Michael O'Donnell <odonnellm@xxxxxxxx>

I have a windows pool with mostly HTCondor 7.8.7 and there seems to be a problem with the central manager tracking the submit machines. The schedd service is always running on these machines, but the central manager/collector cannot detect them after some time (there seems to be no pattern with machines or time). I am using a scheduled executable that runs every 30 minutes which tries to fix these problem, but I really need to find a better solution. The executable uses a condor_restart -schedd and condor_reconfig, which corrects the problem temporarily but this is not sustainable.

I posted about this earlier this week (see below) but basically I cannot find any error messages in log files on the submit machine or central manager.

Does anyone have any thoughts as to what I can do to figure out what is causing this problem?

thank you for the help,

May 29:
I am primarily using 7.8.7 on windows OS within our HTCondor pool and I am noticing that the condor_status -daemon (e.g., -schedd, -master) is not reporting accurately. For example, if I run condor_status, I see all the machines/slots in the pool, but I do not see most of these machine when I run condor_status -master. When I run condor_status -schedd, I do not pick up all the condor submit machines within the pool. However, the schedd service is running on the submit machine and condor_q on the local machine is accurately reporting--I can also submit jobs.  I do not see any errors in the collector log (on central manager) or the schedd log (on submit machines).

Could there be something going on that I am missing, or is it possible this is a bug. I have noticed this problem for a little while and right now I am able to usually (not always) fix the problem by running a condor_restart -schedd. Everything else seems to be functioning as expected. 

Any ideas how to troubleshoot? Thanks,