[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] How to handle a node where jobs are failing



Hi All,

I need a help with the following issue:
there was a PC/node newly connected to our pool which had high rank for many tasks (many of them was assigned to this node even the other was free as well), but the node was wrongly configured by mistake, so all the tasks was not able to run there... This combination caused that almost all the tasks were tried to be run on that node with no success again and again. My question is. Is there a configuration option which enables to eg. disconnect a node from the pool when N subsequent job failed there or to set the pool to try to assign the task to a different node when it failed on the first one. My idea is to decrease the rank of the node for the tasks - set a classad every time when a task fails and use this value in RANK formula... But maybe there is batter way I do not know about.

Thank you in advance!

Masaj