[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Matching to not responding machines

    This sounds more like machines starting the condor service before the firewall has started, and therefore they are not picking up the required ip address. If these are Windows machines, you can change the Condor processes to a delayed start in the control panel.


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Hermann Fuchs
Sent: 27 March 2012 13:36
To: condor-users
Subject: [Condor-users] Matching to not responding machines


In our grid (all condor 7.6.6 machines) we sometimes have clients which do not communicate or stop communicating with the master. 
They somehow seem to be recognized by condor e.g. they appear when condor_status is called.

I think this is happening when a machine reboots and the network is not yet ready. If condor was running on this machine before, the condor master is somehow tricked into believing this machine is still responding. It is not a network problem, the machine can be reached over the network. After restarting condor on the execute node it works flawlessly again.

Is there a way for condor to determine if a machine is not responding and restart the corresponding node of to kick it out?
As the machines are reachable over the network (able to be pinged, logged in etc.) I have no idea how to identify such a not responding machine.

One of the problem with such non responding machines is that they are able to block negotiation completely.
The negotiator will assign a job to it->The request can not be transmitted -> negotiator assigns it again to this machine ....

Attached you will find excerpts from the negotiator log on our master server as well as the master log from the not responding execute node.
For security issues I changed hostname and ip addresses.

To summarize, I would need the following (preferably a solution to all of them):
a) How could I prevent such non responding machines in the first place?
Perhaps each node could check communication with the master and restart conder otherwise?

b) Can the condor master discover such machines and kick them out of the index e.g. restart them?

c) How to keep a non responding machine from tying up the negotiator.
E.g. try a different machine after three failures...

If you need further informations please let me know Any help would be greatly appreciated.

DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology Department of Radiation Oncology Medical University Vienna Währinger Gürtel 18-20
A-1090 Wien

Tel.  + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx