[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Faulty node and idle state



Hi Xavier,

It sounds like you've stumbled on the "HTCondor Black Hole" problem,
which has come up several times before. I'm not sure if we have a
clear solution to it. I think this largely depends on your cluster
size and configuration.

One option is to set a START expression on the failing machine:

STARTD.STATISTICS_TO_PUBLISH_LIST = JobDuration JobBusyTime
START = RecentJobBusyTimeAvg is Undefined || RecentJobBusyTimeAvg >
$(MIN_JOB_TIME)

And set MIN_JOB_TIME to whatever you consider a reasonable minimum job
time, maybe 60 seconds?

Another solution is to use a requirements expression, although this
can be inefficient in larger pools. There's some information and an
example on this page here:
https://research.iac.es/sieinvens/siepedia/pmwiki.php?n=HOWTOs.CondorHowTo#howto_failing

Maybe if you can describe your setup I'll think of something else. How
big is your pool? Why are you keeping that node around at all if it's
faulty?

Mark


On Fri, Sep 24, 2021 at 1:59 AM Xavier OUVRARD <xavier.ouvrard@xxxxxxx> wrote:
>
> Dear all,
>
> I encountered (a solved) problem of a faulty compute node that had some
> troubles to be reached by the scheduler, but that was able to validate
> the acceptation of the job to the central manager that is on another
> machine.
>
> The job failed in idle state; and looking at the scheduler log, the job
> was always resubmitted to the same node for hours. Hence, I was
> wandering if there was a possibility to avoid this kind of behaviour in
> the configuration of the scheduler / central manager, ie that the
> scheduler asks the central manager another node to compute after having
> the job staying in idle state for a while, not started, and that always
> the same node has responded to the central manager?
>
> HTCondor version is 8.8.15-1
>
> Best regards,
>
> Xavier
>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison