[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Match timed out - jobs don't run?



On Thu, Apr 17, 2014 at 11:16:46AM +0200, Steffen Grunewald wrote:
> On Wed, Apr 16, 2014 at 05:21:20PM +0200, Steffen Grunewald wrote:
> > I'm running into problems with a freshly setup pool - as I rewrote
> > all the config, there might be an ugly typo hiding somewhere.

I have reinstalled the master, and a single slave node.
As soon as I release a single job, the compute node's startd writes this to the log:

14-04-25_14:10:02 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
14-04-25_14:10:02 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.0011s
14-04-25_14:10:02 Received TCP command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:55062>, access level DAEMON
14-04-25_14:10:02 Calling HandleReq <command_request_claim> (0) for command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:55062>
14-04-25_14:10:22 condor_read(): timeout reading 5 bytes from <10.150.94.40:55062>.
14-04-25_14:10:22 IO: Failed to read packet header
14-04-25_14:10:22 Can't read ClaimId
14-04-25_14:10:22 Return from HandleReq <command_request_claim> (handler: 20.020s, sec: 0.001s, payload: 0.000s)
14-04-25_14:10:22 Received UDP command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:35503>, access level NEGOTIATOR
14-04-25_14:10:22 Calling HandleReq <command_match_info> (0) for command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:35503>
14-04-25_14:10:22 slot1: match_info called
14-04-25_14:10:22 slot1: Received match <10.150.94.11:59101>#1398427233#1#...
14-04-25_14:10:22 slot1: State change: match notification protocol successful
14-04-25_14:10:22 slot1: Changing state: Unclaimed -> Matched
14-04-25_14:10:22 Return from HandleReq <command_match_info> (handler: 0.000s, sec: 0.001s, payload: 0.000s)
14-04-25_14:20:23 slot1: State change: match timed out
14-04-25_14:20:23 slot1: Changing state: Matched -> Owner
14-04-25_14:20:23 slot1: State change: IS_OWNER is false
14-04-25_14:20:23 slot1: Changing state: Owner -> Unclaimed
14-04-25_14:21:02 Received UDP command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:54788>, access level NEGOTIATOR
14-04-25_14:21:02 Calling HandleReq <command_match_info> (0) for command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:54788>
14-04-25_14:21:02 slot1: match_info called
14-04-25_14:21:02 slot1: Received match <10.150.94.11:59101>#1398427233#3#...
14-04-25_14:21:02 slot1: State change: match notification protocol successful
14-04-25_14:21:02 slot1: Changing state: Unclaimed -> Matched
14-04-25_14:21:02 Return from HandleReq <command_match_info> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
14-04-25_14:21:02 Received TCP command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:35589>, access level DAEMON
14-04-25_14:21:02 Calling HandleReq <command_request_claim> (0) for command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:35589>

This corresponds to a single line in the MatchLog on the master:

14-04-25_14:10:02       Matched 1.0 steffeng@xxxxxxxxxxxx <10.150.94.40:41503> preempting none <10.150.94.11:59101> slot1@xxxxxxxxxxxxxxxxxx
(at 14:21:02, the game starts again)

and the ScheddLog shows

14-04-25_14:10:02 (pid:3880) Using negotiation protocol: NEGOTIATE
14-04-25_14:10:02 (pid:3880) Negotiating for owner: steffeng@xxxxxxxxxxxx
14-04-25_14:10:02 (pid:3880) AutoCluster:config() significant attributes changed to
_condor_RequestCpus,_condor_RequestDisk,_condor_RequestMemory,JobUniverse,LastCheckpointPlatform,NumCkpts,RequestCpus,RequestDisk,RequestMemory
14-04-25_14:10:02 (pid:3880) Checking consistency running and runnable jobs
14-04-25_14:10:02 (pid:3880) Tables are consistent
14-04-25_14:10:02 (pid:3880) Rebuilt prioritized runnable job list in 0.000s.
14-04-25_14:10:02 (pid:3880) Return from HandleReq <negotiate> (handler: 0.001s, sec: 0.000s, payload: 0.000s)
14-04-25_14:10:02 (pid:3880) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (6)
14-04-25_14:10:02 (pid:3880) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0008s
14-04-25_14:10:02 (pid:3880) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (6)
14-04-25_14:10:02 (pid:3880) Lost priority - 1 jobs matched
14-04-25_14:10:02 (pid:3880) Finished negotiating for steffeng in local pool: 1 matched, 0 rejected
14-04-25_14:10:02 (pid:3880) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0001s
14-04-25_14:10:02 (pid:3880) Calling Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> (7)
14-04-25_14:10:02 (pid:3880) Return from Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> 0.0001s
14-04-25_14:10:02 (pid:3880) Calling Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> (7)
14-04-25_14:10:02 (pid:3880) Return from Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> 0.0001s
14-04-25_14:10:02 (pid:3880) Calling Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> (7)
14-04-25_14:10:02 (pid:3880) Return from Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> 0.0002s
14-04-25_14:10:07 (pid:3880) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
14-04-25_14:10:07 (pid:3880) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
14-04-25_14:10:07 (pid:3880) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
14-04-25_14:10:07 (pid:3880) Sent ad to central manager for steffeng@xxxxxxxxxxxx
14-04-25_14:10:07 (pid:3880) Sent ad to 1 collectors for steffeng@xxxxxxxxxxxx
14-04-25_14:10:22 (pid:3880) Calling Handler <DCMessenger::receiveMsgCallback REQUEST_CLAIM> (7)
14-04-25_14:10:22 (pid:3880) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxxx <10.150.94.11:59101> for steffeng 1.0
14-04-25_14:10:22 (pid:3880) Match record (slot1@xxxxxxxxxxxxxxxxxx <10.150.94.11:59101> for steffeng, 1.0) deleted
14-04-25_14:10:22 (pid:3880) Return from Handler <DCMessenger::receiveMsgCallback REQUEST_CLAIM> 0.0001s
...
14-04-25_14:19:57 (pid:3880) Return from HandleReq <receive_startd_update> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
14-04-25_14:20:08 (pid:3880) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
14-04-25_14:20:08 (pid:3880) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
14-04-25_14:20:08 (pid:3880) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
14-04-25_14:20:08 (pid:3880) Sent ad to central manager for steffeng@xxxxxxxxxxxx
14-04-25_14:20:08 (pid:3880) Sent ad to 1 collectors for steffeng@xxxxxxxxxxxx
14-04-25_14:21:02 (pid:3880) Calling Handler <<Negotiator Command>> (6)
14-04-25_14:21:02 (pid:3880) Activity on stashed negotiator socket: <10.150.94.40:49377>
14-04-25_14:21:02 (pid:3880) Using negotiation protocol: NEGOTIATE
14-04-25_14:21:02 (pid:3880) Negotiating for owner: steffeng@xxxxxxxxxxxx
(etc.)

So why does the match time out after 10 minutes? I have not the faintest idea. How can I find out?
Any suggestions how to debug this?

- S

-- 
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}