[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Excluding execute nodes after multiple job failures



Hi all,

I'm just wondering if there is any way of excluding nodes from the pool of available nodes if a certain number of submitted jobs have failed on the node within a given time. This is something I've experienced a few times, either due to a node missing some packages, or an issue with the node etc. In these cases, jobs submitted to the offending node will fail, and then immediately be re-submitted to the same node. This can easily results in a larger number of jobs being marked as failed after using all the retrys.

Thanks, Duncan

--
==========================

Duncan Meacher, PhD
Postdoctoral Researcher
Institute for Gravitation and the Cosmos
Department of Physics
Pennsylvania State University
104 Davey Lab #040
University Park, PA 16802
Tel: +1 814 865 3243
==========================