[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] When condor_restart from central manager worker node doesn't join pool or not rapidly
- Date: Thu, 29 Apr 2010 16:34:48 -0400
- From: Ian Chesal <ian.chesal@xxxxxxxxx>
- Subject: Re: [Condor-users] When condor_restart from central manager worker node doesn't join pool or not rapidly
On Thu, Apr 29, 2010 at 6:00 AM, michele pierri <pierm4ci@xxxxxxxx>
When I type condor_restart from central manager I have this case:
1)Also after a lot of minutes typing condor_status -any I say only the DaemonMaster of worker node in the list,but not the job machine.
2)After about ten/twenty minutes condor_status return job and daemonmaster of worker node.
If I type from worker node condor_restart, it join the pool and condor_status show it after few second.
Thanks in advance.
What may be the problem?
Tricky. Here are a couple of thoughts:
Is hostname <-> IP resolution taking a long time for this machine? If it is, could be it's taking Condor a long time to verify remote admin commands aren't being spoofed before it accepts and acts on the request. That seems like a long shot, but it's not improbable.
If you tail the MasterLog file on the remote machine, while issuing the condor_restart remotely, can you see _where_ Condor is hanging up? Compare and contrast the log file output and time stamps on the log messages when you run condor_restart locally. If you can: consider posting the MasterLog snippet for the reboot in both cases. It should show you where, in the restart chain, Condor is slowing down.