[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs matching to claimed slots



On Fri, Aug 7, 2015 at 5:52 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:

> Does the negotiator send out match info?
> Without that, if the schedd does not claim the slot by time the next negotiation cycle occurs, a second schedd can get a claim to the same startd.
>
Yeah, we have NEGOTIATOR_INFORM_STARTD set to true. From what I've
been able to see, it's not two schedds racing to the startd. The
existing job is running for an hour or more, when the interloper comes
in.

> Even with preemption disabled in the negotiator, the startd thinks it is a preemption.
>
My completely unfounded guess is that it's a bug in the startd that
occasionally gets triggered if the RANK value isn't explicitly set. Of
course, that doesn't explain why the job ends up on a claimed slot to
begin with. Now that I'm back from vacation, I'll try pushing for more
log info.


> In CMS land, this happens to about 0.2% of slots - or about 200 cores at any given time.
>
My sense is that we're seeing this at a higher rate, but in clumps.
Perhaps due to scheduler load or some other contributing factor?


Thanks,
BC

-- 
Ben Cotton
main: 888.292.5320

Cycle Computing
Better Answers. Faster.

http://www.cyclecomputing.com
twitter: @cyclecomputing