Re: [Condor-users] When condor_restart from central manager worker node doesn't join pool or not rapidly

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

On Thu, Apr 29, 2010 at 6:00 AM, michele pierri <pierm4ci@xxxxxxxx> wrote:

Hi,
When I type condor_restart from central manager I have this case:
1)Also after a lot of minutes typing condor_status -any I say only the DaemonMaster of worker node in the list,but not the job machine.
2)After about ten/twenty minutes condor_status return job and daemonmaster of worker node.

If I type from worker node condor_restart, it join the pool and condor_status show it after few second.

Thanks in advance.

What may be the problem?

Tricky. Here are a couple of thoughts:

Is hostname <-> IP resolution taking a long time for this machine? If it is, could be it's taking Condor a long time to verify remote admin commands aren't being spoofed before it accepts and acts on the request. That seems like a long shot, but it's not improbable.

If you tail the MasterLog file on the remote machine, while issuing the condor_restart remotely, can you see _where_ Condor is hanging up? Compare and contrast the log file output and time stamps on the log messages when you run condor_restart locally. If you can: consider posting the MasterLog snippet for the reboot in both cases. It should show you where, in the restart chain, Condor is slowing down.

- Ian

Mailing List Archives

Public Access

Re: [Condor-users] When condor_restart from central manager worker node doesn't join pool or not rapidly