[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] jobs vacating reason
- Date: Thu, 9 Dec 2010 13:03:48 -0500
- From: Erik Aronesty <erik@xxxxxxx>
- Subject: [Condor-users] jobs vacating reason
I'm very new to condor, and although I seem to have gotten it working (one sumbit node, 6 compute nodes, 36 slots), and am running jobs, I have a couple questions:
1. Where can i look to find out precisely why jobs are vacating and restarting?
2. For now, I'm using dedicated machines... and thus I don't want vanilla jobs to "vacate/kill/die" since it just means they get restarted... usually 90% of the way through them. I haven't tried, yet, compiling with condor libs and running standard universe jobs... but i'd like the config to be done nicely for them). If a job without checkpointing is preempted, or if the cpu gets busy, I'd like it to SUSPEND, never vacate.
Here's my relevant configs I can think of. I think perhaps the KILL_VANILLA and VACATE_VANILLA won't do what I expect, and condor may use "more drastic measures" anyway (although I'm not sure what "more drastic" means).
SUSPEND = $(CPUBusy)
WANT_SUSPEND = True
MAXVACATETIME = 20 * $(MINUTE)
VACATE = $(ActivityTimer) > $(MaxSuspendTime)
VACATE_VANILLA = False
WANT_VACATE = True
KILL = $(UWCS_KILL)
KILL_VANILLA = False
PREEMPT = $(UWCS_PREEMPT)
Yet I still get stuff like when looking at the queue:
LastVacateTime = 1291916587
and this when grepping the logs...
Changing state and activity: Claimed/Idle -> Preempting/Vacating