Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Excluding execute nodes after multiple job failures
- Date: Tue, 6 Feb 2018 15:41:50 -0600
- From: Jason Patton <jpatton@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Excluding execute nodes after multiple job failures
Hi Duncan,
Does this htcondor wiki article help?
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=AvoidingBlackHoles
Jason
On Tue, Feb 6, 2018 at 3:38 PM, Duncan Meacher <duncan.meacher@xxxxxxx> wrote:
> Hi all,
>
> I'm just wondering if there is any way of excluding nodes from the pool of
> available nodes if a certain number of submitted jobs have failed on the
> node within a given time. This is something I've experienced a few times,
> either due to a node missing some packages, or an issue with the node etc.
> In these cases, jobs submitted to the offending node will fail, and then
> immediately be re-submitted to the same node. This can easily results in a
> larger number of jobs being marked as failed after using all the retrys.
>
> Thanks, Duncan
>
> --
> ==========================
>
> Duncan Meacher, PhD
> Postdoctoral Researcher
> Institute for Gravitation and the Cosmos
> Department of Physics
> Pennsylvania State University
> 104 Davey Lab #040
> University Park, PA 16802
> Tel: +1 814 865 3243
> ==========================
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/