[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs matching to claimed slots



We've been seeing some interesting behavior at a customer site (I
don't have direct access to the environment, so log availability is
spotty) where it appears that jobs are matched to a slot that is
currently running a job. The incoming job tries to start, but the
claim is rejected:

07/30/15 09:10:28 (25441.71) (23116): Request to run on
slot6@ip-0ABC6265 <10.188.98.101:49288> was REFUSED

There is no corresponding entry in the StartLog or StarterLog.slotX,
but in every case, about a minute later the job exits:

(StarterLog.slot6)
07/30/15 09:11:49 (pid:4432) Process exited, pid=704, status=10
07/30/15 09:11:49 (pid:4432) Got SIGQUIT.  Performing fast shutdown.
07/30/15 09:11:49 (pid:4432) ShutdownFast all jobs.
07/30/15 09:11:49 (pid:4432) **** condor_starter (condor_STARTER) pid
4432 EXITING WITH STATUS 0

(StartLog)
07/30/15 09:11:49 slot6: Called deactivate_claim_forcibly()
07/30/15 09:11:49 slot6: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
07/30/15 09:11:49 Starter pid 4432 exited with status 0
07/30/15 09:11:49 slot6: State change: starter exited
07/30/15 09:11:49 slot6: State change: No preempting claim, returning to owner
07/30/15 09:11:49 slot6: Changing state and activity:
Preempting/Vacating -> Owner/Idle
07/30/15 09:11:49 slot6: State change: IS_OWNER is false
07/30/15 09:11:49 slot6: Changing state: Owner -> Unclaimed
07/30/15 09:11:49 Error: can't find resource with ClaimId
(<10.188.98.101:49288>#1438223073#165#...) for 444 (ACTIVATE_CLAIM)
07/30/15 09:11:49 Error: can't find resource with ClaimId
(<10.188.98.101:49288>#1438223073#165#...) -- perhaps this claim was
already removed?
07/30/15 09:11:49 Error: problem finding resource for 404
(DEACTIVATE_CLAIM_FORCIBLY)
07/30/15 09:11:49 Error: can't find resource with ClaimId
(<10.188.98.101:49288>#1438223073#165#...) for 443 (RELEASE_CLAIM);
perhaps this claim was removed already.


NEGOTIATOR_CONSIDER_PREEMPTION is False on the central manager and
PREEMPT and PREEMPTION_REQUIREMENTS are False on the execute nodes.
RANK is not set. HTCondor 8.2.7 on all daemons. The execute nodes are
Windows Server 2008.


Thanks,
BC

-- 
Ben Cotton
main: 888.292.5320

Cycle Computing
Better Answers. Faster.

http://www.cyclecomputing.com
twitter: @cyclecomputing