[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] What if an active pool PC disappears from the HTCondor radar?
- Date: Thu, 6 Nov 2014 13:03:21 -0500
- From: Ben Cotton <ben.cotton@xxxxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] What if an active pool PC disappears from the HTCondor radar?
On Wed, Oct 29, 2014 at 9:20 AM, Stub <spamrefuse@xxxxxxxxx> wrote:
> What parameters on the HTCondor master determine how to handle such a case?
The CLASSAD_LIFETIME attribute on the collector determines how quickly
to forget about classads. For environments where machines are wont to
go away without warning, a smaller value may be beneficial (just make
sure it's longer than UPDATE_INTERVAL or you'll lose machines you
shouldn't. If your collector and network can handle the extra traffic,
you might want to make UPDATE_INTERVAL 120 seconds and
CLASSAD_LIFETIME 240 seconds)
> 1) I have noticed that my HTCondor master seems to wait for a certain amount of time, but then decides to give up on the job and restart it elsewhere.
> 2) I also have noticed that this "wait time until giving up" is added to the the HTCondor RUN_TIME value, although the job has not made any progress during that time; the log file then has one "ExecuteEvent" followed immediately by the next "ExecuteEvent", without suspension or checkpointing...... Obviously in that case the value of RUN_TIME gets wrongly too big! Could this be a bug in HTCondor?
The advice above won't help with running jobs, unfortunately.
Interestingly, I see in the manual where a startd can give up on a
scheduler, but I'm not seeing much the other way around. Surely I'm
just missing it.
Leader in Utility HPC Software