[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Debugging job restart issues

On Thu, May 1, 2008 at 5:04 PM, Ian Chesal <ICHESAL@xxxxxxxxxx> wrote:
> Hey Matt,
> Nope. Submits are all brokered by my code and the retirement time for
> all jobs is fixed at 2 weeks in the submit tickets. On the machines it's
> set to 16 weeks. See below for the machine config.


>> Also when did the job enter a retiring state, you didn't
>> include that part of the start log.
> Our jobs get automatically put into the retiring state after 600 seconds
> of execution. This is to ensure the slot is returned and re-negotiated.
> Jobs can only be preempted because of RANK in the first 300 seconds of
> execution, after that they're locked to the machine. In our system:
> So it would have changed to the retiring state on 4/27 16:17:09. And it
> got booted off on 4/28 17:35:17 -- long before any retirement timers
> elapsed. Also worth mentioning that the job was un-preemptable because
> AlteraJobAttributeIsInteractive == True for the job.

Hmm - unless your machine happened to be in Guatemala* I can't think
of an obvious reason.
Any clock funny business? Does the machine have it's time controlled by NTP?

* Those dates cover their DST switch