[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Matching to not responding machines



Hello

In our grid (all condor 7.6.6 machines) we sometimes have clients which
do not communicate or stop communicating with the master. 
They somehow seem to be recognized by condor e.g. they appear when
condor_status is called.

I think this is happening when a machine reboots and the network is not
yet ready. If condor was running on this machine before, the condor
master is somehow tricked into believing this machine is still
responding. It is not a network problem, the machine can be reached over
the network. After restarting condor on the execute node it works
flawlessly again.

Is there a way for condor to determine if a machine is not responding
and restart the corresponding node of to kick it out?
As the machines are reachable over the network (able to be pinged,
logged in etc.) I have no idea how to identify such a not responding
machine.

One of the problem with such non responding machines is that they are
able to block negotiation completely.
The negotiator will assign a job to it->The request can not be
transmitted -> negotiator assigns it again to this machine ....

Attached you will find excerpts from the negotiator log on our master
server as well as the master log from the not responding execute node.
For security issues I changed hostname and ip addresses.

To summarize, I would need the following (preferably a solution to all
of them):
a) How could I prevent such non responding machines in the first place?
Perhaps each node could check communication with the master and restart
conder otherwise?

b) Can the condor master discover such machines and kick them out of the
index e.g. restart them?

c) How to keep a non responding machine from tying up the negotiator.
E.g. try a different machine after three failures...

If you need further informations please let me know
Any help would be greatly appreciated.

Cheers,
Hermann
-- 
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien

Tel.  + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx
03/26/12 14:53:55 Setting maximum accepts per cycle 4.
03/26/12 14:53:55 ******************************************************
03/26/12 14:53:55 ** condor_master (CONDOR_MASTER) STARTING UP
03/26/12 14:53:55 ** /usr/sbin/condor_master
03/26/12 14:53:55 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
03/26/12 14:53:55 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
03/26/12 14:53:55 ** $CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
03/26/12 14:53:55 ** $CondorPlatform: x86_64_deb_6.0-updated $
03/26/12 14:53:55 ** PID = 1125
03/26/12 14:53:55 ** Log last touched 3/26 14:52:05
03/26/12 14:53:55 ******************************************************
03/26/12 14:53:55 Using config source: /etc/condor/condor_config
03/26/12 14:53:55 Using local config sources: 
03/26/12 14:53:55    /etc/condor/condor_config.local
03/26/12 14:53:55 SharedPortEndpoint: creating DAEMON_SOCKET_DIR=/var/lock/condor/daemon_sock
03/26/12 14:53:55 SharedPortEndpoint: waiting for connections to named socket 1125_1bf3
03/26/12 14:53:55 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
03/26/12 14:53:55 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
03/26/12 14:53:55 DaemonCore: private command socket at <127.0.0.1:0?sock=1125_1bf3>
03/26/12 14:53:55 Setting maximum accepts per cycle 4.
03/26/12 14:53:55 Started DaemonCore process "/usr/lib/condor/libexec/condor_shared_port", pid and pgroup = 1143
03/26/12 14:53:55 Waiting for /var/lock/condor/shared_port_ad to appear.
03/26/12 14:53:56 Found /var/lock/condor/shared_port_ad.
03/26/12 14:53:56 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 1278
03/26/12 14:54:00 attempt to connect to <123.123.123.123:9618> failed: Network is unreachable (connect errno = 101).  Wil
l keep trying for 20 total seconds (20 to go).

03/26/12 15:53:55 Preen pid is 2293
03/27/12 13:57:27 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd at <127.0.0.1:9618>.
03/27/12 13:57:27 IO: Failed to read packet header
03/27/12 13:57:27 SECMAN: no classad from server, failing
03/27/12 13:57:27 ERROR: SECMAN:2007:Failed to end classad message.
03/27/12 13:57:27       Failed to initiate socket to send MATCH_INFO to slot1@HELLRAISER
03/27/12 13:58:27 ---------- Started Negotiation Cycle ----------
03/27/12 13:58:27 Phase 1:  Obtaining ads from collector ...
03/27/12 13:58:27   Getting all public ads ...
03/27/12 13:58:27   Sorting 58 ads ...
03/27/12 13:58:27   Getting startd private ads ...
03/27/12 13:58:27 Got ads: 58 public and 40 private
03/27/12 13:58:27 Public ads include 1 submitter, 40 startd
03/27/12 13:58:27 Phase 2:  Performing accounting ...
03/27/12 13:58:27 Phase 3:  Sorting submitter ads by priority ...
03/27/12 13:58:27 Phase 4.1:  Negotiating with schedds ...
03/27/12 13:58:27   Negotiating with auser@xxxxxxxxxxxxxxxxxxxx at <123.123.123.123:9618?sock=22323_ad9f_3>
03/27/12 13:58:27 0 seconds so far
03/27/12 13:58:27     Request 01364.00000:
03/27/12 13:58:27       Matched 1364.0 auser@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <123.123.123.123:9618?sock=22323_ad9f_3> preempting none <127.0.0.1:9618?noUDP&sock=1125_1bf3_2> slot1@HELLRAISER
03/27/12 13:58:27       Successfully matched with slot1@HELLRAISER
03/27/12 13:58:27     Got NO_MORE_JOBS;  done negotiating
03/27/12 13:58:27  negotiateWithGroup resources used scheddAds length 0 
03/27/12 13:58:27 ---------- Finished Negotiation Cycle ----------
03/27/12 13:58:27 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd at <127.0.0.1:9618>.
03/27/12 13:58:27 IO: Failed to read packet header
03/27/12 13:58:27 SECMAN: no classad from server, failing
03/27/12 13:58:27 ERROR: SECMAN:2007:Failed to end classad message.