[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] When condor_restart from central manager worker node doesn't join pool or not rapidly



On Thu, Apr 29, 2010 at 6:00 AM, michele pierri <pierm4ci@xxxxxxxx> wrote:
Hi,
When I type condor_restart from central manager I have this case:
1)Also after a lot of minutes typing condor_status -any I say only the DaemonMaster of worker node in the list,but not the job machine.
2)After about ten/twenty minutes condor_status return job and daemonmaster of worker node.

If I type from worker node condor_restart, it join the pool and condor_status show it after few second.

Thanks in advance.

What may be the problem?


Tricky. Here are a couple of thoughts:

Is hostname <-> IP resolution taking a long time for this machine? If it is, could be it's taking Condor a long time to verify remote admin commands aren't being spoofed before it accepts and acts on the request. That seems like a long shot, but it's not improbable.

If you tail the MasterLog file on the remote machine, while issuing the condor_restart remotely, can you see _where_ Condor is hanging up? Compare and contrast the log file output and time stamps on the log messages when you run condor_restart locally. If you can: consider posting the MasterLog snippet for the reboot in both cases. It should show you where, in the restart chain, Condor is slowing down.

- Ian