[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Excluding execute nodes after multiple job failures



Hi Duncan,

Does this htcondor wiki article help?

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=AvoidingBlackHoles

Jason

On Tue, Feb 6, 2018 at 3:38 PM, Duncan Meacher <duncan.meacher@xxxxxxx> wrote:
> Hi all,
>
> I'm just wondering if there is any way of excluding nodes from the pool of
> available nodes if a certain number of submitted jobs have failed on the
> node within a given time. This is something I've experienced a few times,
> either due to a node missing some packages, or an issue with the node etc.
> In these cases, jobs submitted to the offending node will fail, and then
> immediately  be re-submitted to the same node. This can easily results in a
> larger number of jobs being marked as failed after using all the retrys.
>
> Thanks, Duncan
>
> --
> ==========================
>
> Duncan Meacher, PhD
> Postdoctoral Researcher
> Institute for Gravitation and the Cosmos
> Department of Physics
> Pennsylvania State University
> 104 Davey Lab #040
> University Park, PA 16802
> Tel: +1 814 865 3243
> ==========================
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/