[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] jobs vacating reason



I'm very new to condor, and although I seem to have gotten it working (one sumbit node, 6 compute nodes, 36 slots), and am running jobs, I have a couple questions:

1. Where can i look to find out precisely why jobs are vacating and restarting?

2. For now, I'm using dedicated machines... and thus I don't want vanilla jobs to "vacate/kill/die" since it just means they get restarted... usually 90% of the way through them.   I haven't tried, yet, compiling with condor libs and running standard universe jobs... but i'd like the config to be done nicely for them).  If a job without checkpointing is preempted, or if the cpu gets busy, I'd like it to SUSPEND, never vacate.

Here's my relevant configs I can think of.   I think perhaps the KILL_VANILLA and VACATE_VANILLA won't do what I expect, and condor may use "more drastic measures" anyway (although I'm not sure what "more drastic" means).

SUSPEND = $(CPUBusy)
WANT_SUSPEND = True
MAXVACATETIME = 20 * $(MINUTE)
VACATE = $(ActivityTimer) > $(MaxSuspendTime)
VACATE_VANILLA = False
WANT_VACATE = True
KILL = $(UWCS_KILL)
KILL_VANILLA = False
PREEMPT = $(UWCS_PREEMPT)
PREEMPT_VANILLA = False

Yet I still get stuff like when looking at the queue:

LastVacateTime = 1291916587

and this when grepping the logs...

Changing state and activity: Claimed/Idle -> Preempting/Vacating