[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason



> maybe there is a firewall preventing your execute machines from
> connecting to your submit machine?  To test my wild guess, here is
> something to try: lets say your submit machine is my.submit.com (doing
> a condor_status -schedd will show all your submit machine names), can
> you login to an execute machine that kicked off the job lik .246 and
> run:
>
> condor_ping -type schedd -name my.submit.com read
>
> This command will say "read...succeeded" or "read...failed" depending
> upon if it successfully could contact the schedd on your submit
> machine.  If it says "failed", then we know what is happening, and
> you'll need to fix your firewall/network issue.

It's quite difficult here to run commands on remote machines apart from HTCondor...is there something I can look for in the Condor log files? Maybe with some debugging turned on?

>> So finally, my question: how can I examine the details of why HTC is
>> doing this machine switching? I've poked around in various log files
>> but don't see anything obvious. Or, what condor_status or condor_q
>> commands would reveal the motive for the switching?

> I would want to see the StartLog on a machine like .246 from the time
> a job starts until it leaves.  You will see in the log it saying the
> slot going to Claimed->Busy, then you will want to see the messages
> around where it changes away from Claimed->Busy...

FYI, here's the latest example:

000 (114.000.000) 11/11 13:42:00 Job submitted from host: <1.2.3.189:9603>
...
001 (114.000.000) 11/11 13:42:16 Job executing on host: <1.2.3.222:9676>
...
006 (114.000.000) 11/11 13:42:24 Image size of job updated: 1000
    1  -  MemoryUsage of job (MB)
    120  -  ResidentSetSize of job (KB)
...
006 (114.000.000) 11/11 13:47:24 Image size of job updated: 130844
    128  -  MemoryUsage of job (MB)
    130844  -  ResidentSetSize of job (KB)
...
010 (114.000.000) 11/11 13:54:39 Job was suspended.
    Number of processes actually suspended: 3
...
011 (114.000.000) 11/11 13:55:39 Job was unsuspended.
...
010 (114.000.000) 11/11 14:06:25 Job was suspended.
    Number of processes actually suspended: 3
...
011 (114.000.000) 11/11 14:07:29 Job was unsuspended.
...
001 (114.000.000) 11/11 14:14:31 Job executing on host: <1.2.3.248:9612>
...

Here are the relevant lines from the StartLog of machine .222:

11/11/13 13:42:05 slot1: Request accepted.
11/11/13 13:42:10 slot1: Remote owner is rfinch@xxxxxxxxxxxx
11/11/13 13:42:10 slot1: State change: claiming protocol successful
11/11/13 13:42:10 slot1: Changing state: Unclaimed -> Claimed
11/11/13 13:42:10 slot1: match_info called
11/11/13 13:42:10 slot1: Got activate_claim request from shadow (1.2.3.189)
11/11/13 13:42:10 slot1: Remote job ID is 114.0
11/11/13 13:42:10 slot1: Got universe "VANILLA" (5) from request classad
11/11/13 13:42:10 slot1: State change: claim-activation protocol successful
11/11/13 13:42:10 slot1: Changing activity: Idle -> Busy
11/11/13 13:51:40 slot2: match_info called
11/11/13 13:51:40 slot2: Received match <1.2.3.222:9676>#1384203855#2#...
11/11/13 13:51:40 slot2: State change: match notification protocol successful
11/11/13 13:51:40 slot2: Changing state: Unclaimed -> Matched
11/11/13 13:51:40 slot2: Request accepted.
11/11/13 13:51:44 slot2: Remote owner is rfinch@xxxxxxxxxxxx
11/11/13 13:51:44 slot2: State change: claiming protocol successful
11/11/13 13:51:44 slot2: Changing state: Matched -> Claimed
11/11/13 13:51:44 slot3: Request accepted.
11/11/13 13:51:49 slot3: Remote owner is rfinch@xxxxxxxxxxxx
11/11/13 13:51:49 slot3: State change: claiming protocol successful
11/11/13 13:51:49 slot3: Changing state: Unclaimed -> Claimed
11/11/13 13:51:49 slot4: Request accepted.
11/11/13 13:51:53 slot4: Remote owner is rfinch@xxxxxxxxxxxx
11/11/13 13:51:53 slot4: State change: claiming protocol successful
11/11/13 13:51:53 slot4: Changing state: Unclaimed -> Claimed
11/11/13 13:51:53 slot2: Got activate_claim request from shadow (1.2.3.189)
11/11/13 13:51:53 slot2: Remote job ID is 115.0
11/11/13 13:51:53 slot2: Got universe "VANILLA" (5) from request classad
11/11/13 13:51:53 slot2: State change: claim-activation protocol successful
11/11/13 13:51:53 slot2: Changing activity: Idle -> Busy
11/11/13 13:51:53 slot3: match_info called
11/11/13 13:51:53 slot4: match_info called
11/11/13 13:51:55 slot3: Got activate_claim request from shadow (1.2.3.189)
11/11/13 13:51:55 slot3: Remote job ID is 115.1
11/11/13 13:51:55 slot3: Got universe "VANILLA" (5) from request classad
11/11/13 13:51:55 slot3: State change: claim-activation protocol successful
11/11/13 13:51:55 slot3: Changing activity: Idle -> Busy
11/11/13 13:52:00 slot4: Got activate_claim request from shadow (1.2.3.189)
11/11/13 13:52:00 slot4: Remote job ID is 115.2
11/11/13 13:52:00 slot4: Got universe "VANILLA" (5) from request classad
11/11/13 13:52:00 slot4: State change: claim-activation protocol successful
11/11/13 13:52:00 slot4: Changing activity: Idle -> Busy
11/11/13 13:54:38 slot1: State change: SUSPEND is TRUE
11/11/13 13:54:38 slot1: Changing activity: Busy -> Suspended
11/11/13 13:54:38 slot2: State change: SUSPEND is TRUE
11/11/13 13:54:38 slot2: Changing activity: Busy -> Suspended
11/11/13 13:54:38 slot3: State change: SUSPEND is TRUE
11/11/13 13:54:38 slot3: Changing activity: Busy -> Suspended
11/11/13 13:54:38 slot4: State change: SUSPEND is TRUE
11/11/13 13:54:38 slot4: Changing activity: Busy -> Suspended
11/11/13 13:55:39 slot2: State change: CONTINUE is TRUE
11/11/13 13:55:39 slot2: Changing activity: Suspended -> Busy
11/11/13 13:55:39 slot3: State change: CONTINUE is TRUE
11/11/13 13:55:39 slot3: Changing activity: Suspended -> Busy
11/11/13 13:55:39 slot4: State change: CONTINUE is TRUE
11/11/13 13:55:39 slot4: Changing activity: Suspended -> Busy
11/11/13 13:55:39 slot1: State change: CONTINUE is TRUE
11/11/13 13:55:39 slot1: Changing activity: Suspended -> Busy
11/11/13 14:03:33 attempt to connect to <1.2.3.189:9603> failed: connect errno = 10048.  Will keep trying for 97 total seconds (97 to go).

11/11/13 14:06:24 slot2: State change: SUSPEND is TRUE
11/11/13 14:06:24 slot2: Changing activity: Busy -> Suspended
11/11/13 14:06:24 slot3: State change: SUSPEND is TRUE
11/11/13 14:06:24 slot3: Changing activity: Busy -> Suspended
11/11/13 14:06:24 slot4: State change: SUSPEND is TRUE
11/11/13 14:06:24 slot4: Changing activity: Busy -> Suspended
11/11/13 14:06:24 slot1: State change: SUSPEND is TRUE
11/11/13 14:06:24 slot1: Changing activity: Busy -> Suspended
11/11/13 14:07:29 slot4: State change: CONTINUE is TRUE
11/11/13 14:07:29 slot4: Changing activity: Suspended -> Busy
11/11/13 14:07:29 slot1: State change: CONTINUE is TRUE
11/11/13 14:07:29 slot1: Changing activity: Suspended -> Busy
11/11/13 14:07:29 slot2: State change: CONTINUE is TRUE
11/11/13 14:07:29 slot2: Changing activity: Suspended -> Busy
11/11/13 14:07:29 slot3: State change: CONTINUE is TRUE
11/11/13 14:07:29 slot3: Changing activity: Suspended -> Busy
11/11/13 14:13:51 slot1: State change: claim no longer recognized by the schedd - removing claim
11/11/13 14:13:51 slot1: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/11/13 14:14:21 slot1: starter (pid 4092) is not responding to the request to hardkill its job.  The startd will now directly hard kill the starter and all its decendents.
11/11/13 14:14:21 Starter pid 4092 exited with status 0
11/11/13 14:14:21 slot1: State change: starter exited
11/11/13 14:14:21 slot1: State change: No preempting claim, returning to owner
11/11/13 14:14:21 slot1: Changing state and activity: Preempting/Killing -> Owner/Idle
11/11/13 14:14:21 slot1: State change: IS_OWNER is false
11/11/13 14:14:21 slot1: Changing state: Owner -> Unclaimed