[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Matching to not responding machines



Hi

All machines are linux based. Ubuntu 11.10 using condor 7.6.6 from the
debian 6 package supplied by the condor project team.
By default no firewall is started.

Could there be a similar effect on linux machines? This not responding
effect seems to appear mostly after restarts.

Cheers,
Hermann

On Tue, 2012-03-27 at 13:43 +0000, Wilding, Kevan A wrote:
> Hi,
>     This sounds more like machines starting the condor service before the firewall has started, and therefore they are not picking up the required ip address. If these are Windows machines, you can change the Condor processes to a delayed start in the control panel.
> 
> Best
> Kevan
> 
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Hermann Fuchs
> Sent: 27 March 2012 13:36
> To: condor-users
> Subject: [Condor-users] Matching to not responding machines
> 
> Hello
> 
> In our grid (all condor 7.6.6 machines) we sometimes have clients which do not communicate or stop communicating with the master. 
> They somehow seem to be recognized by condor e.g. they appear when condor_status is called.
> 
> I think this is happening when a machine reboots and the network is not yet ready. If condor was running on this machine before, the condor master is somehow tricked into believing this machine is still responding. It is not a network problem, the machine can be reached over the network. After restarting condor on the execute node it works flawlessly again.
> 
> Is there a way for condor to determine if a machine is not responding and restart the corresponding node of to kick it out?
> As the machines are reachable over the network (able to be pinged, logged in etc.) I have no idea how to identify such a not responding machine.
> 
> One of the problem with such non responding machines is that they are able to block negotiation completely.
> The negotiator will assign a job to it->The request can not be transmitted -> negotiator assigns it again to this machine ....
> 
> Attached you will find excerpts from the negotiator log on our master server as well as the master log from the not responding execute node.
> For security issues I changed hostname and ip addresses.
> 
> To summarize, I would need the following (preferably a solution to all of them):
> a) How could I prevent such non responding machines in the first place?
> Perhaps each node could check communication with the master and restart conder otherwise?
> 
> b) Can the condor master discover such machines and kick them out of the index e.g. restart them?
> 
> c) How to keep a non responding machine from tying up the negotiator.
> E.g. try a different machine after three failures...
> 
> If you need further informations please let me know Any help would be greatly appreciated.
> 
> Cheers,
> Hermann
> --
> -------------
> DI Hermann Fuchs
> Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology Department of Radiation Oncology Medical University Vienna Währinger Gürtel 18-20
> A-1090 Wien
> 
> Tel.  + 43 / 1 / 40 400 7271
> Mail. hermann.fuchs@xxxxxxxxxxxxxxxx
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/

-- 
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien

Tel.  + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx