[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Flocking - jobs matched but not started



Hello,

I have managed to get flocking working to the point that a local job is
picked up by a remote condor server (collector?) and the job is matched
to an available client. However that is as far as it gets, the job
doesn't actually run. All the systems involved are running linux. The
local system is using condor 6.6.10; the remote system and the execution
client is running condor 6.7.6.

This is a test setup involving just 3 systems so I can control what is
happening a fair bit. I have modified the execution client to use
FULLDEBUG for the STARTD, and set SEC_DEBUG_PRINT_KEYS to true (the log
file kept mentioning it defaulting to false so I thought I'd change it
to see what else it showed). The client startlog shows:

===========================================================
8/19 12:04:15 match_info called
8/19 12:04:15 Received match <141.163.60.7:32770>#1124448740#7
8/19 12:04:15 Started match timer (19) for 120 seconds.
8/19 12:04:15 State change: match notification protocol successful
8/19 12:04:15 Changing state: Unclaimed -> Matched
8/19 12:04:15 DC_AUTHENTICATE: attempt to open invalid session
ws-60-7:231:1124383840:1, failing.
8/19 12:04:15 DC_AUTHENTICATE: attempt to open invalid session
ws-60-7:231:1124383840:1, failing.
8/19 12:04:19 Trying to update collector <141.163.66.135:9618>
8/19 12:04:19 Attempting to send update via UDP to collector
ltsp.csd.plymouth.ac.uk <141.163.66.135:9618>
8/19 12:04:19 Sent update to 1 collector(s)
8/19 12:04:27 Getting monitoring info for pid 231
8/19 12:06:15 Canceled match timer (19)
8/19 12:06:15 State change: match timed out
8/19 12:06:15 Changing state: Matched -> Owner
8/19 12:06:15 State change: IS_OWNER is false
8/19 12:06:15 Changing state: Owner -> Unclaimed
8/19 12:06:19 Trying to update collector <141.163.66.135:9618>
8/19 12:06:19 Attempting to send update via UDP to collector
ltsp.csd.plymouth.ac.uk <141.163.66.135:9618>
8/19 12:06:19 Sent update to 1 collector(s)
===========================================================

I am now stuck. The job seems to have been accepted, condor_status shows
the execution client as 'matched', but the job doesn't start.

Anyone any ideas as to where I go from here? I'll try setting startd to
debug_all to see if it offers anything more.


Thanks,

John.

-- 
---------------------------------------------------------------
John Horne, University of Plymouth, UK  Tel: +44 (0)1752 233914
E-mail: John.Horne@xxxxxxxxxxxxxx       Fax: +44 (0)1752 233839