[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Windows many execute nodes not taking jobs



Hi HTCondor community

I’m running into a problem where a bunch of Windows 7 execute nodes are not accepting jobs. I have a pool of about 50 machines (with 4 or 8 nodes each) and about half of them merrily accept jobs submitted from a Linux submit node. The other half do not. Condor version is CondorVersion: 8.2.1 Jun 27 2014 BuildID: 256063

These machines were all built from the same image and in all cases are running on their D: drives.

Any advice would be welcome! I tried restarting the service, restarting the machines, and restarting HTCondor on the central manager — all to no avail.

Thanks!
Mike Fienen
USGS Wisconsin Water Science Center
Middleton, WI USA


The cluster log on the submit machine is full of messages like this:
024 (1823.046.000) 10/09 08:07:50 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxx, rescheduling job

Then going to BLHBLAH34.gs.doi.net, I find this in the StartLog:
10/09/14 08:09:07 slot1: match_info called
10/09/14 08:09:07 slot1: Received match <xxx.xxx.xxx.134:64106>#1412826697#975#...
10/09/14 08:09:07 slot1: State change: match notification protocol successful
10/09/14 08:09:07 slot1: Changing state: Unclaimed -> Matched
10/09/14 08:09:07 slot1_1: New machine resource of type -1 allocated
10/09/14 08:09:07 slot1: Changing state: Matched -> Unclaimed
10/09/14 08:09:07 Setting up slot pairings
10/09/14 08:09:07 slot1_1: Request accepted.
10/09/14 08:09:07 slot1_1: Remote owner is mnfienen@xxxxxxxxxxxxxxxxxxxxxxxx
10/09/14 08:09:07 slot1_1: State change: claiming protocol successful
10/09/14 08:09:07 slot1_1: Changing state: Owner -> Claimed
10/09/14 08:09:07 slot1_1: Got activate_claim request from shadow (xxx.xxx.xxx.72)
10/09/14 08:09:07 slot1_1: Remote job ID is 1823.40
10/09/14 08:09:07 slot1_1: Got universe "VANILLA" (5) from request classad
10/09/14 08:09:07 slot1_1: State change: claim-activation protocol successful
10/09/14 08:09:07 slot1_1: Changing activity: Idle -> Busy
10/09/14 08:09:08 condor_read() failed: recv(fd=712) returned -1, errno = 10054 , reading 5 bytes from <127.0.0.1:49722>.
10/09/14 08:09:08 IO: Failed to read packet header
10/09/14 08:09:08 Starter pid 3732 exited with status -1073740940
10/09/14 08:09:08 slot1_1: State change: starter exited
10/09/14 08:09:08 slot1_1: Changing activity: Busy -> Idle
10/09/14 08:09:08 Aborting CA_LOCATE_STARTER
10/09/14 08:09:08 ClaimId (<xxx.xxx.xxx.134:64106>#1412826697#975#40425bd1402e06a6391f4fdec6b771e1e7daa7b2) and GlobalJobId (BLAHBLAHM000.er.usgs.gov#1823.40#1412854596 ) not found
10/09/14 08:09:08 slot1_1: State change: received RELEASE_CLAIM command
10/09/14 08:09:08 slot1_1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
10/09/14 08:09:08 slot1_1: State change: No preempting claim, returning to owner
10/09/14 08:09:08 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle
10/09/14 08:09:08 slot1_1: State change: IS_OWNER is false
10/09/14 08:09:08 slot1_1: Changing state: Owner -> Unclaimed
10/09/14 08:09:08 slot1_1: Changing state: Unclaimed -> Delete
10/09/14 08:09:08 slot1_1: Resource no longer needed, deleting