[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Matching to not responding machines



Hi Rob and Steffen

Tweaking the init Script looks like a promising approach and as a first
reaction I'll implement something like that.

However, I would like to implement some kind of a failure detection for
the running grid as network problems will and do occur.
Is there a query which is only answered when the machines do
communicate? 
condor_status seems to be misleading, the machines listed there which
stopped communicating remain there in some cases (e.g. the mentioned
case).

The match_list_length seems to be a good idea to minimize problems
occurring from such machines, thank you for this tip.

Best regards,
Hermann
On Wed, 2012-03-28 at 10:24 +0200, Rob de Graaf wrote:
> Hi Hermann,
> 
> If the problem occurs only on system reboot and not on condor restarts, 
> the boot sequence is where I'd look first. Have you tried simply adding 
> a sleep 10 to the condor init script? This is a bit of a hack but may 
> help if condor is somehow being started before all dependencies are up 
> and running, similar to a delayed start on Windows.
> 
> > c) How to keep a non responding machine from tying up the negotiator.
> > E.g. try a different machine after three failures...
> 
> You might be able to use match_list_length (man condor_submit). Jobs 
> will keep a list of machines they were recently matched with, so you can 
> set up job requirements to avoid them.
> 
> Regards,
> 
> Rob
> 
> On 03/28/2012 09:25 AM, Hermann Fuchs wrote:
> > Hi
> >
> > Any ideas how to solve this issue?
> >
> > Best regards,
> > Hermann
> > On Tue, 2012-03-27 at 16:08 +0200, Hermann Fuchs wrote:
> >> Hi
> >>
> >> All machines are linux based. Ubuntu 11.10 using condor 7.6.6 from the
> >> debian 6 package supplied by the condor project team.
> >> By default no firewall is started.
> >>
> >> Could there be a similar effect on linux machines? This not responding
> >> effect seems to appear mostly after restarts.
> >>
> >> Cheers,
> >> Hermann
> >>
> >> On Tue, 2012-03-27 at 13:43 +0000, Wilding, Kevan A wrote:
> >>> Hi,
> >>>      This sounds more like machines starting the condor service before the firewall has started, and therefore they are not picking up the required ip address. If these are Windows machines, you can change the Condor processes to a delayed start in the control panel.
> >>>
> >>> Best
> >>> Kevan
> >>>
> >>> -----Original Message-----
> >>> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Hermann Fuchs
> >>> Sent: 27 March 2012 13:36
> >>> To: condor-users
> >>> Subject: [Condor-users] Matching to not responding machines
> >>>
> >>> Hello
> >>>
> >>> In our grid (all condor 7.6.6 machines) we sometimes have clients which do not communicate or stop communicating with the master.
> >>> They somehow seem to be recognized by condor e.g. they appear when condor_status is called.
> >>>
> >>> I think this is happening when a machine reboots and the network is not yet ready. If condor was running on this machine before, the condor master is somehow tricked into believing this machine is still responding. It is not a network problem, the machine can be reached over the network. After restarting condor on the execute node it works flawlessly again.
> >>>
> >>> Is there a way for condor to determine if a machine is not responding and restart the corresponding node of to kick it out?
> >>> As the machines are reachable over the network (able to be pinged, logged in etc.) I have no idea how to identify such a not responding machine.
> >>>
> >>> One of the problem with such non responding machines is that they are able to block negotiation completely.
> >>> The negotiator will assign a job to it->The request can not be transmitted ->  negotiator assigns it again to this machine ....
> >>>
> >>> Attached you will find excerpts from the negotiator log on our master server as well as the master log from the not responding execute node.
> >>> For security issues I changed hostname and ip addresses.
> >>>
> >>> To summarize, I would need the following (preferably a solution to all of them):
> >>> a) How could I prevent such non responding machines in the first place?
> >>> Perhaps each node could check communication with the master and restart conder otherwise?
> >>>
> >>> b) Can the condor master discover such machines and kick them out of the index e.g. restart them?
> >>>
> >>> c) How to keep a non responding machine from tying up the negotiator.
> >>> E.g. try a different machine after three failures...
> >>>
> >>> If you need further informations please let me know Any help would be greatly appreciated.
> >>>
> >>> Cheers,
> >>> Hermann
> >>> --
> >>> -------------
> >>> DI Hermann Fuchs
> >>> Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology Department of Radiation Oncology Medical University Vienna Währinger Gürtel 18-20
> >>> A-1090 Wien
> >>>
> >>> Tel.  + 43 / 1 / 40 400 7271
> >>> Mail. hermann.fuchs@xxxxxxxxxxxxxxxx
> >>> _______________________________________________
> >>> Condor-users mailing list
> >>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >>> subject: Unsubscribe
> >>> You can also unsubscribe by visiting
> >>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>
> >>> The archives can be found at:
> >>> https://lists.cs.wisc.edu/archive/condor-users/
> >>
> >> --
> >> -------------
> >> DI Hermann Fuchs
> >> Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
> >> Department of Radiation Oncology
> >> Medical University Vienna
> >> Währinger Gürtel 18-20
> >> A-1090 Wien
> >>
> >> Tel.  + 43 / 1 / 40 400 7271
> >> Mail. hermann.fuchs@xxxxxxxxxxxxxxxx
> >>
> >> _______________________________________________
> >> Condor-users mailing list
> >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >> subject: Unsubscribe
> >> You can also unsubscribe by visiting
> >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >> The archives can be found at:
> >> https://lists.cs.wisc.edu/archive/condor-users/
> >
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/

-- 
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien

Tel.  + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx