[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Debugging job restart issues



Hey Matt, 

> On Tue, Apr 29, 2008 at 8:03 PM, Ian Chesal 
> <ICHESAL@xxxxxxxxxx> wrote:
> > I'm trying to figure out why jobs are being dropped from 
> starters and 
> > restarted elsewhere on one of my farms. The scenario is as follows:
> 
> Did the submit accidentally set the requested retirement time 
> to some tiny value?
> That might cause it.

Nope. Submits are all brokered by my code and the retirement time for
all jobs is fixed at 2 weeks in the submit tickets. On the machines it's
set to 16 weeks. See below for the machine config.
 
> Also when did the job enter a retiring state, you didn't 
> include that part of the start log.

Our jobs get automatically put into the retiring state after 600 seconds
of execution. This is to ensure the slot is returned and re-negotiated.
Jobs can only be preempted because of RANK in the first 300 seconds of
execution, after that they're locked to the machine. In our system:

ALTERA_MaxJobRetirementTime = 9676800
ALTERA_EARLY_PREEMPTION_TIME = 300
ALTERA_AUTO_RETIREMENT_TIME = 600
WANT_VACATE 	= ( $(ActivationTimer) > 600 || $(IsPVM) || $(IsVanilla)
)
WANT_SUSPEND 	= False
SUSPEND 		= False
CONTINUE 		= True
PREEMPT = ( $(ActivationTimer) > $(ALTERA_AUTO_RETIREMENT_TIME) )
KILL = $(ActivityTimer) > $(MaxVacateTime)
MaxJobRetirementTime = \
( \
   $(ALTERA_MaxJobRetirementTime) * \
	( \
		(Activity != "Idle") && \
		( \
			($(ActivationTimer) >
$(ALTERA_EARLY_PREEMPTION_TIME)) || \
			(MY.AlteraJobAttributeIsInteractive =?= TRUE) \
		) \
	) \
)

So it would have changed to the retiring state on 4/27 16:17:09. And it
got booted off on 4/28 17:35:17 -- long before any retirement timers
elapsed. Also worth mentioning that the job was un-preemptable because
AlteraJobAttributeIsInteractive == True for the job.

- Ian


Confidentiality Notice.  This message may contain information that is confidential or otherwise protected from disclosure.
If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution, 
or copying of this message, or any attachments, is strictly prohibited.  If you have received this message in error, 
please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.