[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Faulty node and idle state



Hi Mark,

Thanks a lot for the answer and suggestion. Of course the faulty node was removed immediately when I was able to diagnose the origin of the problem and I made the necessary corrections to have it back properly on the pool, solving all idle jobs I have.

My question was more a question of prevention of this kind of behaviour, where a node constantly says that it can ensure the job, starts many times it but never delivers other thing than a idle state. I was more thinking to a configuration at the scheduler side, whenever it sees that a job remaining in idle state on a node, to ask at least for another execution on another appropriate machine, ie to perform 2 trials on 2 different machines before going to an idle state.

The pool contains a bit more than 500 nodes, on 50 different machines approximately.

Best regards,
Xavier


On 27/09/2021 22:48, Mark Coatsworth wrote:
Hi Xavier,

It sounds like you've stumbled on the "HTCondor Black Hole" problem,
which has come up several times before. I'm not sure if we have a
clear solution to it. I think this largely depends on your cluster
size and configuration.

One option is to set a START _expression_ on the failing machine:

STARTD.STATISTICS_TO_PUBLISH_LIST = JobDuration JobBusyTime
START = RecentJobBusyTimeAvg is Undefined || RecentJobBusyTimeAvg >
$(MIN_JOB_TIME)

And set MIN_JOB_TIME to whatever you consider a reasonable minimum job
time, maybe 60 seconds?

Another solution is to use a requirements _expression_, although this
can be inefficient in larger pools. There's some information and an
example on this page here:
https://research.iac.es/sieinvens/siepedia/pmwiki.php?n=HOWTOs.CondorHowTo#howto_failing

Maybe if you can describe your setup I'll think of something else. How
big is your pool? Why are you keeping that node around at all if it's
faulty?

Mark


On Fri, Sep 24, 2021 at 1:59 AM Xavier OUVRARD <xavier.ouvrard@xxxxxxx> wrote:
Dear all,

I encountered (a solved) problem of a faulty compute node that had some
troubles to be reached by the scheduler, but that was able to validate
the acceptation of the job to the central manager that is on another
machine.

The job failed in idle state; and looking at the scheduler log, the job
was always resubmitted to the same node for hours. Hence, I was
wandering if there was a possibility to avoid this kind of behaviour in
the configuration of the scheduler / central manager, ie that the
scheduler asks the central manager another node to compute after having
the job staying in idle state for a while, not started, and that always
the same node has responded to the central manager?

HTCondor version is 8.8.15-1

Best regards,

Xavier



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Xavier Ouvrard-Brunet, Ph.D.
(RP Cluster administrator â HSE-RP-CS) @ CERN
Office 892/2A-12, Prevessins-MoÃns site
CERN, Esplanade des Particules, 1
CH-1211 Geneva 23
Mobile: +41 75 411 12 01
TÃl: +41 22 766 38 92
Personal research page:
www.infos-informatique.net