[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Fwd: Problems with HTCondor schedd or collector [tracking of submit machines]



I would like to add/re-iterate that there seems to be problems with a couple things, not just submit machines. If I run condor_status, I can see all condor machines and submit machines. If I run condor_status -master, I get a very small subset of the machines listed in the condor_status. Also, sometimes when I run condor_reconfig, I get an error that the master cannot be located even though the services are running and condor_status picks up the machine. For example:
C:\Users\odonnellm>condor_reconfig igskbacbwscdrs3
Can't find address for master igskbacbwscdrs3.gs.doi.net
Perhaps you need to query another pool.

So, it seems like this is an issue with the collector but not really sure.

thanks,
mike



---------- Forwarded message ----------
From: O'Donnell, Michael <odonnellm@xxxxxxxx>
Date: Fri, May 31, 2013 at 6:37 AM
Subject: Problems with HTCondor schedd or collector [tracking of submit machines]
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Michael O'Donnell <odonnellm@xxxxxxxx>


I have a windows pool with mostly HTCondor 7.8.7 and there seems to be a problem with the central manager tracking the submit machines. The schedd service is always running on these machines, but the central manager/collector cannot detect them after some time (there seems to be no pattern with machines or time). I am using a scheduled executable that runs every 30 minutes which tries to fix these problem, but I really need to find a better solution. The executable uses a condor_restart -schedd and condor_reconfig, which corrects the problem temporarily but this is not sustainable.

I posted about this earlier this week (see below) but basically I cannot find any error messages in log files on the submit machine or central manager.

Does anyone have any thoughts as to what I can do to figure out what is causing this problem?

thank you for the help,
Mike

May 29:
I am primarily using 7.8.7 on windows OS within our HTCondor pool and I am noticing that the condor_status -daemon (e.g., -schedd, -master) is not reporting accurately. For example, if I run condor_status, I see all the machines/slots in the pool, but I do not see most of these machine when I run condor_status -master. When I run condor_status -schedd, I do not pick up all the condor submit machines within the pool. However, the schedd service is running on the submit machine and condor_q on the local machine is accurately reporting--I can also submit jobs.  I do not see any errors in the collector log (on central manager) or the schedd log (on submit machines).

Could there be something going on that I am missing, or is it possible this is a bug. I have noticed this problem for a little while and right now I am able to usually (not always) fix the problem by running a condor_restart -schedd. Everything else seems to be functioning as expected. 

Any ideas how to troubleshoot? Thanks,
mike