[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Flocking - jobs matched but not started



On Fri, 2005-08-19 at 07:33 -0500, Zachary Miller wrote:
> > This is a test setup involving just 3 systems so I can control what is
> > happening a fair bit. I have modified the execution client to use
> > FULLDEBUG for the STARTD, and set SEC_DEBUG_PRINT_KEYS to true (the log
> > file kept mentioning it defaulting to false so I thought I'd change it
> > to see what else it showed). The client startlog shows:
> 
> actually, what would be more useful is to turn on D_SECURITY and D_FULLDEBUG,
> or simply D_ALL.
> 
Okay, I have some output with D_ALL, and without the local schedd
crashing. The StartLog shows:

=======================================================
8/19 16:56:17 (fd:6) SECMAN: startCommand succeeded.
8/19 16:56:17 (fd:6) SEND [1000] <141.163.60.7:32771>
<141.163.66.135:9618>
8/19 16:56:17 (fd:6) SEND [1000] <141.163.60.7:32771>
<141.163.66.135:9618>
8/19 16:56:17 (fd:6) SEND [1000] <141.163.60.7:32771>
<141.163.66.135:9618>
8/19 16:56:17 (fd:6) SEND [225] <141.163.60.7:32771>
<141.163.66.135:9618>
8/19 16:56:17 (fd:5) Sent update to 1 collector(s)
8/19 16:56:17 (fd:5) In cancel_timer(), id=7
8/19 16:56:17 (fd:5) DaemonCore Timeout() Complete, returning 114
8/19 16:56:50 (fd:5) RECV 587 bytes at <141.163.60.7:32770> from
<141.163.66.135:32956>
8/19 16:56:50 (fd:5)    Full msg [587 bytes]
8/19 16:56:50 (fd:5) DC_AUTHENTICATE: received UDP packet from
<141.163.66.135:32956>.
8/19 16:56:50 (fd:5) DC_AUTHENTICATE: received DC_AUTHENTICATE from
<141.163.66.135:32956>
8/19 16:56:50 (fd:5) DC_AUTHENTICATE: received following ClassAd:
MyType = "(unknown type)"
TargetType = "(unknown type)"
OutgoingNegotiation = "PREFERRED"
Subsystem = "NEGOTIATOR"
ParentUniqueID = "ltsp:6875:1124463378"
ServerPid = 6877
SessionDuration = "8640000"
AuthCommand = 440
Enact = "YES"
AuthMethodsList = "FS,KERBEROS,GSI"
AuthMethods = "FS"
CryptoMethods = "3DES,BLOWFISH"
Authentication = "NO"
Encryption = "NO"
Integrity = "NO"
UseSession = "YES"
Sid = "ws-60-7:231:1124466143:0"
ValidCommands = "440"
RemoteVersion = "$CondorVersion: 6.7.6 Mar 15 2005 $"
ServerCommandSock = "<141.163.66.135:37576>"
Command = 440
8/19 16:56:50 (fd:5) DC_AUTHENTICATE: attempt to open invalid session
ws-60-7:231:1124466143:0, failing.
8/19 16:56:50 (fd:6) LOWPORT undefined
8/19 16:56:50 (fd:6) SEND [33] <141.163.60.7:32771>
<141.163.66.135:37576>
8/19 16:56:50 (fd:5) DC_AUTHENTICATE: sent DC_INVALIDATE
ws-60-7:231:1124466143:0 to <141.163.66.135:37576>.
8/19 16:56:50 (fd:5) In DaemonCore Timeout()
8/19 16:56:50 (fd:5)
8/19 16:56:50 (fd:5) DaemonCore--> Timers
8/19 16:56:51 (fd:5) DaemonCore--> ~~~~~~
8/19 16:56:51 (fd:5) DaemonCore--> id = 0, when = 1124467091, period =
120, handler_descrip=<check_parent>
8/19 16:56:51 (fd:5) DaemonCore--> id = 3, when = 1124467205, period =
240, handler_descrip=<self_monitor>
8/19 16:56:51 (fd:5) DaemonCore--> id = 6, when = 1124467264, period =
300, handler_descrip=<eval_and_update_all>
8/19 16:56:51 (fd:5) DaemonCore--> id = 1, when = 1124467266, period =
300, handler_descrip=<check_session_cache>
8/19 16:56:51 (fd:5) DaemonCore--> id = 5, when = 1124468141, period =
1170, handler_descrip=<DaemonCore::SendAliveToParent>
8/19 16:56:51 (fd:5) DaemonCore--> id = 2, when = 1124468767, period =
1801, handler_descrip=<handle_cookie_refresh>
8/19 16:56:51 (fd:5) DaemonCore--> id = 4, when = 1124495726, period =
0, handler_descrip=<DaemonCore::ReInit()>
8/19 16:56:51 (fd:5)
8/19 16:56:51 (fd:5) DaemonCore Timeout() Complete, returning 80
8/19 16:56:51 (fd:6) ACCEPT src=<141.163.60.7:32770> fd=5
dst=<141.163.60.56:45064>
8/19 16:56:51 (fd:6) DC_AUTHENTICATE: received DC_AUTHENTICATE from
<141.163.60.56:45064>
8/19 16:56:51 (fd:6) DC_AUTHENTICATE: received following ClassAd:
MyType = "(unknown type)"
TargetType = "(unknown type)"
OutgoingNegotiation = "PREFERRED"
Subsystem = "SCHEDD"
ParentUniqueID = "ws-60-56:18672:1124462523"
ServerPid = 19425
SessionDuration = "8640000"
Enact = "YES"
AuthMethodsList = "FS,KERBEROS,GSI"
AuthMethods = "FS"
CryptoMethods = "3DES,BLOWFISH"
Authentication = "NO"
Encryption = "NO"
Integrity = "NO"
UseSession = "YES"
Sid = "ws-60-7:231:1124466143:1"
ValidCommands = "403,404,427,435,436,441,442,443,444,446,466"
RemoteVersion = "$CondorVersion: 6.6.10 Jun 13 2005 $"
ServerCommandSock = "<141.163.60.56:44957>"
Command = 442
ServerTime = 1124467010
8/19 16:56:51 (fd:6) DC_AUTHENTICATE: attempt to open invalid session
ws-60-7:231:1124466143:1, failing.
8/19 16:56:52 (fd:7) LOWPORT undefined
8/19 16:56:52 (fd:7) SEND [33] <141.163.60.7:32771>
<141.163.60.56:44957>
8/19 16:56:52 (fd:6) DC_AUTHENTICATE: sent DC_INVALIDATE
ws-60-7:231:1124466143:1 to <141.163.60.56:44957>.
8/19 16:56:52 (fd:6) CLOSE <141.163.60.7:32770> fd=5
8/19 16:56:52 (fd:5) In DaemonCore Timeout()
8/19 16:56:52 (fd:5)
8/19 16:56:52 (fd:5) DaemonCore--> Timers
8/19 16:56:52 (fd:5) DaemonCore--> ~~~~~~
=======================================================

If you need more info then let me know.

To try and clarify things a little bit with the IP addresses/names:

   141.163.60.56 (ws-60-56.dhcp.plymouth.ac.uk):
      The 'personal condor' linux box used to submit and execute jobs.
      It flocks to 141.163.66.135. To force the flocking I have stopped
      startd.

   141.163.66.135 (ltsp.csd.plymouth.ac.uk):
       The remote condor/LTSP server. It runs all the daemons except
       startd. Remote clients will connect to this server. The clients
       run LTSP and NFS share the /opt/condor disk space which contains
       the condor executables.

   141.163.60.7 (ws-60-7.dhcp.plymouth.ac.uk):
       The condor client. This runs a hardened LTSP, and uses NFS to
       share condor from the ltsp server (above). It only runs the
       master and startd daemons locally.

So, I am submitting jobs on the local server (60.56), which are flocked
to the remote server (66.135) and that server passes them to a client in
its pool (60.7).


John.

-- 
---------------------------------------------------------------
John Horne, University of Plymouth, UK  Tel: +44 (0)1752 233914
E-mail: John.Horne@xxxxxxxxxxxxxx       Fax: +44 (0)1752 233839