[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_defrag only some machines?



Mike,

I also turned off preemption in my pools - I had a situation in the first two weeks of my HTCondor existence where jobs were sloshing back and forth because I didn't really understand the rank expressions - half the jobs would get preempted at 75% complete, then the other half of the jobs would get to 75% complete and then get preempted by the first half, leading to zero goodput.

However, one pool in particular continues to struggle with the large-job starvation issue. They're managing it manually at the moment, since it's a small group of users ("Fry! Pizza goin' out! COME ON!!"), but I've put some thought into the issue and have come up with a few ideas, one of which I'm hoping I can present at this year's HTCondor Week.

One thing you may consider is setting aside certain machines which outright reject non-whole-machine jobs to keep them available for the large ones. You could set a machine requirement that the job must request at least a certain number of CPUs, for example, to be allowed to match to the machine. You could apply that requirement on a schedule so the machine would go into that mode overnight or on the weekends, and thus would let the small jobs drain out peacefully without eviction. You might even have them go into that mode depending on the state of the job queue, for that matter.

	-Michael Pelletier.

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Di Domenico
Sent: Friday, January 13, 2017 8:48 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_defrag only some machines?

i've not.  we turned off all preemption sometime ago because it was wreaking havoc with our users.  condor was fine, but the users were getting very displeased to see their jobs preempted like 2 mins before the job was to finish.  i'm sure there's probably some tweaking and user training that might correct this, but i'm not sure i can stomach that again