[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs matching to claimed slots



Does the negotiator send out match info?

Without that, if the schedd does not claim the slot by time the next negotiation cycle occurs, a second schedd can get a claim to the same startd.

Even with preemption disabled in the negotiator, the startd thinks it is a preemption.

In CMS land, this happens to about 0.2% of slots - or about 200 cores at any given time.

At least, this is what I think is happening.  Haven't had the time to dig deeply to see if it is true.

If your schedd's are not very busy, then you might have to look for a different explanation.

Brian

Sent from my iPhone

> On Aug 7, 2015, at 3:43 PM, Ben Cotton <ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:
> 
> We've been seeing some interesting behavior at a customer site (I
> don't have direct access to the environment, so log availability is
> spotty) where it appears that jobs are matched to a slot that is
> currently running a job. The incoming job tries to start, but the
> claim is rejected:
> 
> 07/30/15 09:10:28 (25441.71) (23116): Request to run on
> slot6@ip-0ABC6265 <10.188.98.101:49288> was REFUSED
> 
> There is no corresponding entry in the StartLog or StarterLog.slotX,
> but in every case, about a minute later the job exits:
> 
> (StarterLog.slot6)
> 07/30/15 09:11:49 (pid:4432) Process exited, pid=704, status=10
> 07/30/15 09:11:49 (pid:4432) Got SIGQUIT.  Performing fast shutdown.
> 07/30/15 09:11:49 (pid:4432) ShutdownFast all jobs.
> 07/30/15 09:11:49 (pid:4432) **** condor_starter (condor_STARTER) pid
> 4432 EXITING WITH STATUS 0
> 
> (StartLog)
> 07/30/15 09:11:49 slot6: Called deactivate_claim_forcibly()
> 07/30/15 09:11:49 slot6: Changing state and activity: Claimed/Busy ->
> Preempting/Vacating
> 07/30/15 09:11:49 Starter pid 4432 exited with status 0
> 07/30/15 09:11:49 slot6: State change: starter exited
> 07/30/15 09:11:49 slot6: State change: No preempting claim, returning to owner
> 07/30/15 09:11:49 slot6: Changing state and activity:
> Preempting/Vacating -> Owner/Idle
> 07/30/15 09:11:49 slot6: State change: IS_OWNER is false
> 07/30/15 09:11:49 slot6: Changing state: Owner -> Unclaimed
> 07/30/15 09:11:49 Error: can't find resource with ClaimId
> (<10.188.98.101:49288>#1438223073#165#...) for 444 (ACTIVATE_CLAIM)
> 07/30/15 09:11:49 Error: can't find resource with ClaimId
> (<10.188.98.101:49288>#1438223073#165#...) -- perhaps this claim was
> already removed?
> 07/30/15 09:11:49 Error: problem finding resource for 404
> (DEACTIVATE_CLAIM_FORCIBLY)
> 07/30/15 09:11:49 Error: can't find resource with ClaimId
> (<10.188.98.101:49288>#1438223073#165#...) for 443 (RELEASE_CLAIM);
> perhaps this claim was removed already.
> 
> 
> NEGOTIATOR_CONSIDER_PREEMPTION is False on the central manager and
> PREEMPT and PREEMPTION_REQUIREMENTS are False on the execute nodes.
> RANK is not set. HTCondor 8.2.7 on all daemons. The execute nodes are
> Windows Server 2008.
> 
> 
> Thanks,
> BC
> 
> -- 
> Ben Cotton
> main: 888.292.5320
> 
> Cycle Computing
> Better Answers. Faster.
> 
> http://www.cyclecomputing.com
> twitter: @cyclecomputing
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/