[HTCondor-users] Problems with HTCondor schedd or collector [tracking of submit machines]

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

I have a windows pool with mostly HTCondor 7.8.7 and there seems to be a problem with the central manager tracking the submit machines. The schedd service is always running on these machines, but the central manager/collector cannot detect them after some time (there seems to be no pattern with machines or time). I am using a scheduled executable that runs every 30 minutes which tries to fix these problem, but I really need to find a better solution. The executable uses a condor_restart -schedd and condor_reconfig, which corrects the problem temporarily but this is not sustainable.

I posted about this earlier this week (see below) but basically I cannot find any error messages in log files on the submit machine or central manager.

Does anyone have any thoughts as to what I can do to figure out what is causing this problem?

thank you for the help,

Mike

May 29:

I am primarily using 7.8.7 on windows OS within our HTCondor pool and I am noticing that the condor_status -daemon (e.g., -schedd, -master) is not reporting accurately. For example, if I run condor_status, I see all the machines/slots in the pool, but I do not see most of these machine when I run condor_status -master. When I run condor_status -schedd, I do not pick up all the condor submit machines within the pool. However, the schedd service is running on the submit machine and condor_q on the local machine is accurately reporting--I can also submit jobs. I do not see any errors in the collector log (on central manager) or the schedd log (on submit machines).

Could there be something going on that I am missing, or is it possible this is a bug. I have noticed this problem for a little while and right now I am able to usually (not always) fix the problem by running a condor_restart -schedd. Everything else seems to be functioning as expected.

Any ideas how to troubleshoot? Thanks,

mike

Mailing List Archives

Public Access

[HTCondor-users] Problems with HTCondor schedd or collector [tracking of submit machines]