[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] startd job count limit to limit the damage of black holes



Hello Michael,

Thank you for the reply, but I don't see how it is helpful.  Before adding individual new machines to the cluster, I set NUM_SLOTS =1, which for my purposes is the same effect as your suggestion.  It  doesn't stop the machine from rapidly draining the queue if the jobs are failing immediately (though of course it is less rapid, but still rapid nonetheless).

-Wayne


Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx> wrote:

> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
> Of Betts, Wayne
> Sent: Monday, August 28, 2017 4:37 PM
> To: htcondor-users@xxxxxxxxxxx
> Subject: [HTCondor-users] startd job count limit to limit the damage of
> black holes
>
>
> START = (TotalJobsStarted < 2)Â # where TotalJobsStarted is the missing
> piece that I've yet to find, so am seeking your help.

You can make the startd lie to the negotiator about how many CPU cores the machine has via NUM_CPUS in the configuration, or configure the unproven system with a single static whole-machine slot instead of a partitionable slot or collection of static slots.

With that approach, you wouldn't need to alter the start _expression_ at all.


-Michael Pelletier.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/