[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs vacating reason



Thanks!  

Although what I've found is that I do want my vanilla jobs to suspend when the load average - especially by other users.   But I don't want them to be killed, ever.

I changed SUSPEND_VANILLA=True, and I'm hoping that will ensure vanilla jobs will always get suspended, not killed, when condor decides preempting or vacating may be necessary. 


On Thu, Dec 9, 2010 at 1:32 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
On 12/09/2010 01:03 PM, Erik Aronesty wrote:
I'm very new to condor, and although I seem to have gotten it working
(one sumbit node, 6 compute nodes, 36 slots), and am running jobs, I
have a couple questions:

1. Where can i look to find out precisely why jobs are vacating and
restarting?

2. For now, I'm using dedicated machines... and thus I don't want
vanilla jobs to "vacate/kill/die" since it just means they get
restarted... usually 90% of the way through them.   I haven't tried,
yet, compiling with condor libs and running standard universe jobs...
but i'd like the config to be done nicely for them).  If a job without
checkpointing is preempted, or if the cpu gets busy, I'd like it to
SUSPEND, never vacate.

Here's my relevant configs I can think of.   I think perhaps
the KILL_VANILLA and VACATE_VANILLA won't do what I expect, and condor
may use "more drastic measures" anyway (although I'm not sure what "more
drastic" means).

SUSPEND = $(CPUBusy)
WANT_SUSPEND = True
MAXVACATETIME = 20 * $(MINUTE)
VACATE = $(ActivityTimer) > $(MaxSuspendTime)
VACATE_VANILLA = False
WANT_VACATE = True
KILL = $(UWCS_KILL)
KILL_VANILLA = False
PREEMPT = $(UWCS_PREEMPT)
PREEMPT_VANILLA = False

Yet I still get stuff like when looking at the queue:

LastVacateTime = 1291916587

and this when grepping the logs...

Changing state and activity: Claimed/Idle -> Preempting/Vacating

The StartLog is going to be your best source of information, though you may need to change START_DEBUG = D_FULLDEBUG.

Since your resources are dedicated and you are starting out, you might be interested in starting with a very simple policy that allows everything to run to completion. You can build from it later.

START = TRUE
WANT_SUSPEND = FALSE
WANT_VACATE = FALSE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE

Above is config that should be set on any nodes running a condor_startd.

Now, there's one other thing to consider: preemption. You can disable it with,

PREEMPTION_REQUIREMENTS = FALSE
RANK = 0

which is configuration for the machine running your condor_negotiator.

You can read about all this in the manual,

 http://www.cs.wisc.edu/condor/manual/v7.5/3_5Policy_Configuration.html

Best,


matt