[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Problems with HTCondor schedd or collector [tracking of submit machines]
- Date: Fri, 31 May 2013 06:37:56 -0600
- From: "O'Donnell, Michael" <odonnellm@xxxxxxxx>
- Subject: [HTCondor-users] Problems with HTCondor schedd or collector [tracking of submit machines]
I have a windows pool with mostly HTCondor 7.8.7 and there seems to be a problem with the central manager tracking the submit machines. The schedd service is always running on these machines, but the central manager/collector cannot detect them after some time (there seems to be no pattern with machines or time). I am using a scheduled executable that runs every 30 minutes which tries to fix these problem, but I really need to find a better solution. The executable uses a condor_restart -schedd and condor_reconfig, which corrects the problem temporarily but this is not sustainable.
I posted about this earlier this week (see below) but basically I cannot find any error messages in log files on the submit machine or central manager.
Does anyone have any thoughts as to what I can do to figure out what is causing this problem?
thank you for the help,
I am primarily using 7.8.7 on windows OS within our HTCondor pool and I am noticing that the condor_status -daemon (e.g., -schedd, -master) is not reporting accurately. For example, if I run condor_status, I see all the machines/slots in the pool, but I do not see most of these machine when I run condor_status -master. When I run condor_status -schedd, I do not pick up all the condor submit machines within the pool. However, the schedd service is running on the submit machine and condor_q on the local machine is accurately reporting--I can also submit jobs. I do not see any errors in the collector log (on central manager) or the schedd log (on submit machines).
Could there be something going on that I am missing, or is it possible this is a bug. I have noticed this problem for a little while and right now I am able to usually (not always) fix the problem by running a condor_restart -schedd. Everything else seems to be functioning as expected.
Any ideas how to troubleshoot? Thanks,