[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Match timed out - jobs don't run?



On Fri, Apr 25, 2014 at 02:31:19PM +0200, Steffen Grunewald wrote:
> On Thu, Apr 17, 2014 at 11:16:46AM +0200, Steffen Grunewald wrote:
> > On Wed, Apr 16, 2014 at 05:21:20PM +0200, Steffen Grunewald wrote:
> > > I'm running into problems with a freshly setup pool - as I rewrote
> > > all the config, there might be an ugly typo hiding somewhere.

Found in the StarterLog on the schedd machine:

14-04-25_15:42:21 failed to create classad; bad expr = 'Error occurred during initialization of VM
'

Um. - ???

> 
> I have reinstalled the master, and a single slave node.
> As soon as I release a single job, the compute node's startd writes this to the log:
> 
> 14-04-25_14:10:02 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
> 14-04-25_14:10:02 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.0011s
> 14-04-25_14:10:02 Received TCP command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:55062>, access level DAEMON
> 14-04-25_14:10:02 Calling HandleReq <command_request_claim> (0) for command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:55062>
> 14-04-25_14:10:22 condor_read(): timeout reading 5 bytes from <10.150.94.40:55062>.
> 14-04-25_14:10:22 IO: Failed to read packet header
> 14-04-25_14:10:22 Can't read ClaimId
> 14-04-25_14:10:22 Return from HandleReq <command_request_claim> (handler: 20.020s, sec: 0.001s, payload: 0.000s)
> 14-04-25_14:10:22 Received UDP command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:35503>, access level NEGOTIATOR
> 14-04-25_14:10:22 Calling HandleReq <command_match_info> (0) for command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:35503>
> 14-04-25_14:10:22 slot1: match_info called
> 14-04-25_14:10:22 slot1: Received match <10.150.94.11:59101>#1398427233#1#...
> 14-04-25_14:10:22 slot1: State change: match notification protocol successful
> 14-04-25_14:10:22 slot1: Changing state: Unclaimed -> Matched
> 14-04-25_14:10:22 Return from HandleReq <command_match_info> (handler: 0.000s, sec: 0.001s, payload: 0.000s)
> 14-04-25_14:20:23 slot1: State change: match timed out
> 14-04-25_14:20:23 slot1: Changing state: Matched -> Owner
> 14-04-25_14:20:23 slot1: State change: IS_OWNER is false
> 14-04-25_14:20:23 slot1: Changing state: Owner -> Unclaimed
> 14-04-25_14:21:02 Received UDP command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:54788>, access level NEGOTIATOR
> 14-04-25_14:21:02 Calling HandleReq <command_match_info> (0) for command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:54788>
> 14-04-25_14:21:02 slot1: match_info called
> 14-04-25_14:21:02 slot1: Received match <10.150.94.11:59101>#1398427233#3#...
> 14-04-25_14:21:02 slot1: State change: match notification protocol successful
> 14-04-25_14:21:02 slot1: Changing state: Unclaimed -> Matched
> 14-04-25_14:21:02 Return from HandleReq <command_match_info> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
> 14-04-25_14:21:02 Received TCP command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:35589>, access level DAEMON
> 14-04-25_14:21:02 Calling HandleReq <command_request_claim> (0) for command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:35589>
> 
> This corresponds to a single line in the MatchLog on the master:
> 
> 14-04-25_14:10:02       Matched 1.0 steffeng@xxxxxxxxxxxx <10.150.94.40:41503> preempting none <10.150.94.11:59101> slot1@xxxxxxxxxxxxxxxxxx
> (at 14:21:02, the game starts again)
> 
> and the ScheddLog shows
> 
> 14-04-25_14:10:02 (pid:3880) Using negotiation protocol: NEGOTIATE
> 14-04-25_14:10:02 (pid:3880) Negotiating for owner: steffeng@xxxxxxxxxxxx
> 14-04-25_14:10:02 (pid:3880) AutoCluster:config() significant attributes changed to
> _condor_RequestCpus,_condor_RequestDisk,_condor_RequestMemory,JobUniverse,LastCheckpointPlatform,NumCkpts,RequestCpus,RequestDisk,RequestMemory
> 14-04-25_14:10:02 (pid:3880) Checking consistency running and runnable jobs
> 14-04-25_14:10:02 (pid:3880) Tables are consistent
> 14-04-25_14:10:02 (pid:3880) Rebuilt prioritized runnable job list in 0.000s.
> 14-04-25_14:10:02 (pid:3880) Return from HandleReq <negotiate> (handler: 0.001s, sec: 0.000s, payload: 0.000s)
> 14-04-25_14:10:02 (pid:3880) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (6)
> 14-04-25_14:10:02 (pid:3880) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0008s
> 14-04-25_14:10:02 (pid:3880) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (6)
> 14-04-25_14:10:02 (pid:3880) Lost priority - 1 jobs matched
> 14-04-25_14:10:02 (pid:3880) Finished negotiating for steffeng in local pool: 1 matched, 0 rejected
> 14-04-25_14:10:02 (pid:3880) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0001s
> 14-04-25_14:10:02 (pid:3880) Calling Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> (7)
> 14-04-25_14:10:02 (pid:3880) Return from Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> 0.0001s
> 14-04-25_14:10:02 (pid:3880) Calling Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> (7)
> 14-04-25_14:10:02 (pid:3880) Return from Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> 0.0001s
> 14-04-25_14:10:02 (pid:3880) Calling Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> (7)
> 14-04-25_14:10:02 (pid:3880) Return from Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> 0.0002s
> 14-04-25_14:10:07 (pid:3880) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 14-04-25_14:10:07 (pid:3880) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
> 14-04-25_14:10:07 (pid:3880) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
> 14-04-25_14:10:07 (pid:3880) Sent ad to central manager for steffeng@xxxxxxxxxxxx
> 14-04-25_14:10:07 (pid:3880) Sent ad to 1 collectors for steffeng@xxxxxxxxxxxx
> 14-04-25_14:10:22 (pid:3880) Calling Handler <DCMessenger::receiveMsgCallback REQUEST_CLAIM> (7)
> 14-04-25_14:10:22 (pid:3880) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxxx <10.150.94.11:59101> for steffeng 1.0
> 14-04-25_14:10:22 (pid:3880) Match record (slot1@xxxxxxxxxxxxxxxxxx <10.150.94.11:59101> for steffeng, 1.0) deleted
> 14-04-25_14:10:22 (pid:3880) Return from Handler <DCMessenger::receiveMsgCallback REQUEST_CLAIM> 0.0001s
> ...
> 14-04-25_14:19:57 (pid:3880) Return from HandleReq <receive_startd_update> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
> 14-04-25_14:20:08 (pid:3880) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 14-04-25_14:20:08 (pid:3880) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
> 14-04-25_14:20:08 (pid:3880) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
> 14-04-25_14:20:08 (pid:3880) Sent ad to central manager for steffeng@xxxxxxxxxxxx
> 14-04-25_14:20:08 (pid:3880) Sent ad to 1 collectors for steffeng@xxxxxxxxxxxx
> 14-04-25_14:21:02 (pid:3880) Calling Handler <<Negotiator Command>> (6)
> 14-04-25_14:21:02 (pid:3880) Activity on stashed negotiator socket: <10.150.94.40:49377>
> 14-04-25_14:21:02 (pid:3880) Using negotiation protocol: NEGOTIATE
> 14-04-25_14:21:02 (pid:3880) Negotiating for owner: steffeng@xxxxxxxxxxxx
> (etc.)
> 
> So why does the match time out after 10 minutes? I have not the faintest idea. How can I find out?
> Any suggestions how to debug this?
> 
> - S
> 
> -- 
> Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
> MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
> http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}