[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Schedd sending startddeactivate_claim_forciblycommand?



Just to add to Ian's previous message. On the submitter side, we find in
the scheduler log the following:

11/2 14:36:01 timed out requesting claim from <137.57.176.182:3865>
11/2 14:36:01 Sent RELEASE_CLAIM to startd on <137.57.176.182:3865>
11/2 14:36:01 timed out requesting claim from <137.57.176.180:4557>
11/2 14:36:01 Sent RELEASE_CLAIM to startd on <137.57.176.180:4557>
11/2 14:36:01 timed out requesting claim from <137.57.176.185:1407>
11/2 14:36:01 Sent RELEASE_CLAIM to startd on <137.57.176.185:1407>

The scheduler also seems to be getting a lot of VACATE_SERVICE commands:

11/2 14:34:18 DaemonCore: Command received via TCP from host
<137.57.176.182:1460>
11/2 14:34:18 DaemonCore: received command 443 (VACATE_SERVICE), calling
handler (vacate_service)
11/2 14:34:18 Got VACATE_SERVICE from <137.57.176.182:1460>

What are these? Could someone figure out what's going on from these
commands?

Thanks
Jimmy


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: Tuesday, November 02, 2004 3:07 PM
To: Condor-Users Mail List
Subject: [Condor-users] Schedd sending startd
deactivate_claim_forciblycommand?

I'm encountering a mysterious situation with our cluster. Jobs are
starting, entering the retirement phase because of machine rank based on
queue times, and then not being allowed the full MaxJobRetirementTime to
finish -- instead the scheduler from which all the jobs orginated is
sending this "command 404 (DEACTIVATE_CLAIM_FORCIBLY)" to the startd
running the job and it's being killed off.
 
On my startd machine I have the following log output (see below).
Following the vm2 starter you can see it gets a claim, starts running
it, moves it to retirement/preemption because of rank and then gets a
forcible deactivate command from the schedd machine (137.57.142.51). Why
is the schedd calling this command on the startd? Why is in not retiring
the job and allowing it to continue through the MaxJobRetirementTime
(which is set to 2 days)?
 
We are a little baffled at how we got into this mess. With our RANK
based on queue time we're seeing jobs just trashing in our system.
 
Thanks in advance for any insight you can offer!
 
Ian
 
11/2 14:16:20 DaemonCore: Command received via UDP from host
<137.57.176.9:35511>
11/2 14:16:20 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
11/2 14:16:20 vm2: match_info called
11/2 14:16:20 vm2: Received match <137.57.176.186:2147>#1099203682#159
11/2 14:16:20 vm2: State change: match notification protocol successful
11/2 14:16:20 vm2: Changing state: Unclaimed -> Matched
11/2 14:16:20 DaemonCore: Command received via TCP from host
<137.57.142.51:3349>
11/2 14:16:20 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
11/2 14:16:20 vm2: Request accepted.
11/2 14:16:20 vm2: Remote owner is gquan@xxxxxxxxxx
11/2 14:16:20 vm2: State change: claiming protocol successful
11/2 14:16:20 vm2: Changing state: Matched -> Claimed
11/2 14:16:27 DaemonCore: Command received via TCP from host
<137.57.142.51:3365>
11/2 14:16:27 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
11/2 14:16:27 vm2: Got activate_claim request from shadow
(<137.57.142.51:3365>)
11/2 14:16:27 vm2: Remote job ID is 3.3
11/2 14:16:27 vm2: Got universe "VANILLA" (5) from request classad
11/2 14:16:27 vm2: State change: claim-activation protocol successful
11/2 14:16:27 vm2: Changing activity: Idle -> Busy
11/2 14:21:23 DaemonCore: Command received via TCP from host
<137.57.176.186:2309>
11/2 14:21:23 DaemonCore: received command 448 (GIVE_STATE), calling
handler (command_give_state)
11/2 14:32:03 DaemonCore: Command received via TCP from host
<137.57.142.51:3699>
11/2 14:32:03 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
11/2 14:32:03 vm1: Called deactivate_claim_forcibly()
11/2 14:32:03 DaemonCore: Command received via UDP from host
<137.57.142.51:3702>
11/2 14:32:03 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/2 14:32:03 vm1: State change: received RELEASE_CLAIM command
11/2 14:32:03 vm1: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
11/2 14:32:03 DaemonCore: Command received via UDP from host
<137.57.142.51:3703>
11/2 14:32:03 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/2 14:32:03 vm1: Got RELEASE_CLAIM while in Preempting state,
ignoring.
11/2 14:32:04 DaemonCore: Command received via UDP from host
<137.57.176.186:2333>
11/2 14:32:04 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
11/2 14:32:04 Starter pid 1896 exited with status 0
11/2 14:32:13 vm1: State change: starter exited
11/2 14:32:13 vm1: State change: No preempting claim, returning to owner
11/2 14:32:13 vm1: Changing state and activity: Preempting/Vacating ->
Owner/Idle
11/2 14:32:13 vm1: State change: IS_OWNER is false
11/2 14:32:13 vm1: Changing state: Owner -> Unclaimed
11/2 14:32:47 DaemonCore: Command received via TCP from host
<137.57.142.51:3726>
11/2 14:32:47 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
11/2 14:32:47 vm2: Called deactivate_claim_forcibly()
11/2 14:32:47 DaemonCore: Command received via UDP from host
<137.57.176.186:2343>
11/2 14:32:47 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
11/2 14:32:47 Starter pid 1024 exited with status 0
11/2 14:32:54 vm2: State change: starter exited
11/2 14:32:58 vm2: Changing activity: Busy -> Idle
11/2 14:32:58 DaemonCore: Command received via UDP from host
<137.57.142.51:3729>
11/2 14:32:58 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/2 14:32:58 vm2: State change: received RELEASE_CLAIM command
11/2 14:32:58 vm2: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
11/2 14:32:58 vm2: State change: No preempting claim, returning to owner
11/2 14:32:58 vm2: Changing state and activity: Preempting/Vacating ->
Owner/Idle
11/2 14:32:58 vm2: State change: IS_OWNER is false
11/2 14:32:58 vm2: Changing state: Owner -> Unclaimed
11/2 14:32:58 DaemonCore: Command received via UDP from host
<137.57.142.51:3730>
11/2 14:32:58 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
11/2 14:32:58 Error: can't find resource with ClaimId
(<137.57.176.186:2147>#1099203682#159)


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users