[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_defrag only some machines?



i've been running with defrag for a few days now and seem to have two problems:

1. i'm seeing job evictions.  i set the max vacate and max
jobretirement parameters to 24hrs, but i'm still seeing evictions
after 10mins (cycle time for defrag), is there some other setting that
will prevent defrag from evicting a job.  we have no preemption or
eviction requirements and try very hard not to kill jobs,
preempt/suspend/etc are all set to false

2. defrag seems to be trying to defrag nodes for which there are no
matching jobs in the queue.  for instance, there are jobs in the pool
currently that require 16 cores and >100GB of memory, but it's trying
to defrag nodes that have max 8 cores and < 100GB of memory.

i suspect this is because the defrag requirements are set at the
defaults of -ExpectedMachineGracefulDrainingBadput, which probably
doesn't take into account which nodes actually need to be defragged.
is there some other setting that will only defrag a node if it's
capable of running a job in the queue and if it's not just leave it
alone?



On Fri, Jan 13, 2017 at 10:53 AM, Michael Pelletier
<Michael.V.Pelletier@xxxxxxxxxxxx> wrote:
> Mike,
>
> I also turned off preemption in my pools - I had a situation in the first two weeks of my HTCondor existence where jobs were sloshing back and forth because I didn't really understand the rank expressions - half the jobs would get preempted at 75% complete, then the other half of the jobs would get to 75% complete and then get preempted by the first half, leading to zero goodput.
>
> However, one pool in particular continues to struggle with the large-job starvation issue. They're managing it manually at the moment, since it's a small group of users ("Fry! Pizza goin' out! COME ON!!"), but I've put some thought into the issue and have come up with a few ideas, one of which I'm hoping I can present at this year's HTCondor Week.
>
> One thing you may consider is setting aside certain machines which outright reject non-whole-machine jobs to keep them available for the large ones. You could set a machine requirement that the job must request at least a certain number of CPUs, for example, to be allowed to match to the machine. You could apply that requirement on a schedule so the machine would go into that mode overnight or on the weekends, and thus would let the small jobs drain out peacefully without eviction. You might even have them go into that mode depending on the state of the job queue, for that matter.
>
>         -Michael Pelletier.
>
> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Di Domenico
> Sent: Friday, January 13, 2017 8:48 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] condor_defrag only some machines?
>
> i've not.  we turned off all preemption sometime ago because it was wreaking havoc with our users.  condor was fine, but the users were getting very displeased to see their jobs preempted like 2 mins before the job was to finish.  i'm sure there's probably some tweaking and user training that might correct this, but i'm not sure i can stomach that again
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/