[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] 6.8.6 -> 7.0.5 on Windows taking a long time to vacate jobs



I'm in the process of moving Windows machines from 6.8.6 to 7.0.5 and I
was noting issues with thrashing and my startd RANK policy on machines
running 7.0.5. It appears that 7.0.5 takes a very long time to preempt
running jobs when a higher startd RANK job comes along. I can switch in
6.8.6 for 7.0.5 and the same jobs take only a minute to preempt but when
I move to 7.0.5 the jobs take >10 minutes to preempt.

In the time it takes to preempt jobs on 7.0.5-based machines the waiting
jobs give up their claim.

I tried increasing REQUEST_CLAIM_TIMEOUT from 900 to 1200 seconds but it
didn't make a difference. It's not diserable for my preemption policy to
push that number too much higher.

Has something changed from 6.8.6 to 7.0.5 in the way Condor is killing
jobs when they're preempted? I'm wondering why this operation takes so
much longer in 7.0.5 than it did in 6.8.6. These are plain vanilla
universe jobs. So no checkpointing.

Actually, if I change REQUEST_CLAIM_TIMEOUT and do a 'condor_reconfig
-full -all' does it apply to newly spawned shadows or do I have to
restart Condor completely on my schedulers for this to take effect?

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.