[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] What if an active pool PC disappears from the HTCondor radar?

On Wed, Oct 29, 2014 at 9:20 AM, Stub <spamrefuse@xxxxxxxxx> wrote:

> What parameters on the HTCondor master determine how to handle such a case?
The CLASSAD_LIFETIME attribute on the collector determines how quickly
to forget about classads. For environments where machines are wont to
go away without warning, a smaller value may be beneficial (just make
sure it's longer than UPDATE_INTERVAL or you'll lose machines you
shouldn't. If your collector and network can handle the extra traffic,
you might want to make UPDATE_INTERVAL 120 seconds and

> 1) I have noticed that my HTCondor master seems to wait for a certain amount of time, but then decides to give up on the job and restart it elsewhere.
> 2) I also have noticed that this "wait time until giving up" is added to the the HTCondor RUN_TIME value, although the job has not made any progress during that time; the log file then has one "ExecuteEvent" followed immediately by the next "ExecuteEvent", without suspension or checkpointing...... Obviously in that case the value of RUN_TIME gets wrongly too big! Could this be a bug in HTCondor?
The advice above won't help with running jobs, unfortunately.
Interestingly, I see in the manual where a startd can give up on a
scheduler, but I'm not seeing much the other way around. Surely I'm
just missing it.


Ben Cotton
main: 888.292.5320

Cycle Computing
Leader in Utility HPC Software

twitter: @cyclecomputing