[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Schedd sending startd deactivate_claim_forciblycommand?



Ian,

In the log output you provided, I don't see any transition to retirement or preemption due to any competing claim. All I see is a job running (Claimed/Busy) and then DEACTIVATE_CLAIM_FORCIBLY causing the claim to be closed. Do you believe this is true in general or did you just post a log snippet that is not representative of the problem you are seeing?

--Dan

Ian Chesal wrote:

I'm encountering a mysterious situation with our cluster. Jobs are
starting, entering the retirement phase because of machine rank based on
queue times, and then not being allowed the full MaxJobRetirementTime to
finish -- instead the scheduler from which all the jobs orginated is
sending this "command 404 (DEACTIVATE_CLAIM_FORCIBLY)" to the startd
running the job and it's being killed off.

On my startd machine I have the following log output (see below).
Following the vm2 starter you can see it gets a claim, starts running
it, moves it to retirement/preemption because of rank and then gets a
forcible deactivate command from the schedd machine (137.57.142.51). Why
is the schedd calling this command on the startd? Why is in not retiring
the job and allowing it to continue through the MaxJobRetirementTime
(which is set to 2 days)?

We are a little baffled at how we got into this mess. With our RANK
based on queue time we're seeing jobs just trashing in our system.

Thanks in advance for any insight you can offer!

Ian

11/2 14:16:20 DaemonCore: Command received via UDP from host
<137.57.176.9:35511>
11/2 14:16:20 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
11/2 14:16:20 vm2: match_info called
11/2 14:16:20 vm2: Received match <137.57.176.186:2147>#1099203682#159
11/2 14:16:20 vm2: State change: match notification protocol successful
11/2 14:16:20 vm2: Changing state: Unclaimed -> Matched
11/2 14:16:20 DaemonCore: Command received via TCP from host
<137.57.142.51:3349>
11/2 14:16:20 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
11/2 14:16:20 vm2: Request accepted.
11/2 14:16:20 vm2: Remote owner is gquan@xxxxxxxxxx
11/2 14:16:20 vm2: State change: claiming protocol successful
11/2 14:16:20 vm2: Changing state: Matched -> Claimed
11/2 14:16:27 DaemonCore: Command received via TCP from host
<137.57.142.51:3365>
11/2 14:16:27 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
11/2 14:16:27 vm2: Got activate_claim request from shadow
(<137.57.142.51:3365>)
11/2 14:16:27 vm2: Remote job ID is 3.3
11/2 14:16:27 vm2: Got universe "VANILLA" (5) from request classad
11/2 14:16:27 vm2: State change: claim-activation protocol successful
11/2 14:16:27 vm2: Changing activity: Idle -> Busy
11/2 14:21:23 DaemonCore: Command received via TCP from host
<137.57.176.186:2309>
11/2 14:21:23 DaemonCore: received command 448 (GIVE_STATE), calling
handler (command_give_state)
11/2 14:32:03 DaemonCore: Command received via TCP from host
<137.57.142.51:3699>
11/2 14:32:03 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
11/2 14:32:03 vm1: Called deactivate_claim_forcibly()
11/2 14:32:03 DaemonCore: Command received via UDP from host
<137.57.142.51:3702>
11/2 14:32:03 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/2 14:32:03 vm1: State change: received RELEASE_CLAIM command
11/2 14:32:03 vm1: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
11/2 14:32:03 DaemonCore: Command received via UDP from host
<137.57.142.51:3703>
11/2 14:32:03 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/2 14:32:03 vm1: Got RELEASE_CLAIM while in Preempting state,
ignoring.
11/2 14:32:04 DaemonCore: Command received via UDP from host
<137.57.176.186:2333>
11/2 14:32:04 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
11/2 14:32:04 Starter pid 1896 exited with status 0
11/2 14:32:13 vm1: State change: starter exited
11/2 14:32:13 vm1: State change: No preempting claim, returning to owner
11/2 14:32:13 vm1: Changing state and activity: Preempting/Vacating ->
Owner/Idle
11/2 14:32:13 vm1: State change: IS_OWNER is false
11/2 14:32:13 vm1: Changing state: Owner -> Unclaimed
11/2 14:32:47 DaemonCore: Command received via TCP from host
<137.57.142.51:3726>
11/2 14:32:47 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
11/2 14:32:47 vm2: Called deactivate_claim_forcibly()
11/2 14:32:47 DaemonCore: Command received via UDP from host
<137.57.176.186:2343>
11/2 14:32:47 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
11/2 14:32:47 Starter pid 1024 exited with status 0
11/2 14:32:54 vm2: State change: starter exited
11/2 14:32:58 vm2: Changing activity: Busy -> Idle
11/2 14:32:58 DaemonCore: Command received via UDP from host
<137.57.142.51:3729>
11/2 14:32:58 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/2 14:32:58 vm2: State change: received RELEASE_CLAIM command
11/2 14:32:58 vm2: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
11/2 14:32:58 vm2: State change: No preempting claim, returning to owner
11/2 14:32:58 vm2: Changing state and activity: Preempting/Vacating ->
Owner/Idle
11/2 14:32:58 vm2: State change: IS_OWNER is false
11/2 14:32:58 vm2: Changing state: Owner -> Unclaimed
11/2 14:32:58 DaemonCore: Command received via UDP from host
<137.57.142.51:3730>
11/2 14:32:58 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/2 14:32:58 Error: can't find resource with ClaimId
(<137.57.176.186:2147>#1099203682#159)


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users