[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] 6.8.6 -> 7.0.5 on Windows taking a long time to vacate jobs



> I'm in the process of moving Windows machines from 6.8.6 to
> 7.0.5 and I was noting issues with thrashing and my startd
> RANK policy on machines running 7.0.5. It appears that 7.0.5
> takes a very long time to preempt running jobs when a higher
> startd RANK job comes along. I can switch in 6.8.6 for 7.0.5
> and the same jobs take only a minute to preempt but when I
> move to 7.0.5 the jobs take >10 minutes to preempt.
>
> In the time it takes to preempt jobs on 7.0.5-based machines
> the waiting jobs give up their claim.
>
> I tried increasing REQUEST_CLAIM_TIMEOUT from 900 to 1200
> seconds but it didn't make a difference. It's not diserable
> for my preemption policy to push that number too much higher.
>
> Has something changed from 6.8.6 to 7.0.5 in the way Condor
> is killing jobs when they're preempted? I'm wondering why
> this operation takes so much longer in 7.0.5 than it did in
> 6.8.6. These are plain vanilla universe jobs. So no checkpointing.
>
> Actually, if I change REQUEST_CLAIM_TIMEOUT and do a
> 'condor_reconfig -full -all' does it apply to newly spawned
> shadows or do I have to restart Condor completely on my
> schedulers for this to take effect?

Digging around a bit it could be related to:

http://www.cs.wisc.edu/condor/manual/v7.0/3_6Security.html#sec:RunAsNobo
dy

I have:

SLOT1_USER=ALTERA\cndrusr1
SLOT2_USER=ALTERA\cndrusr2
SLOT3_USER=ALTERA\cndrusr3
SLOT4_USER=ALTERA\cndrusr4
SLOT5_USER=ALTERA\cndrusr5
SLOT6_USER=ALTERA\cndrusr6
SLOT7_USER=ALTERA\cndrusr7
SLOT8_USER=ALTERA\cndrusr8

But I set:

DEDICATED_EXECUTE_ACCOUNT_REGEXP = cndrusr[0-9]+

Should I have included the domain in that regexp?

DEDICATED_EXECUTE_ACCOUNT_REGEXP = ALTERA\\cndrusr[0-9]+

My machines are saying USE_PROCD is undefined (and it is in my configs)
but it's starting up so does that mean condor_startd is using
condor_procd to track and kill processes on my Windows machines? Could
this be my problem? That procd is doing this work?

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.