Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Problem with START configuration for allocating whole machine - VMs stuck in owner state after job is removed
- Date: Fri, 20 Feb 2009 14:57:20 -0800
- From: David Brodbeck <brodbd@xxxxxxxxxxxxxxxx>
- Subject: [Condor-users] Problem with START configuration for allocating whole machine - VMs stuck in owner state after job is removed
I'm trying to do something along the lines of what's described here,
to provide a single-slot option for large jobs:
http://nmi.cs.wisc.edu/node/1482
I'm implementing it on condor 6.8.8, so I've changed the syntax to use
VirtualMachineID instead of SlotID and vm1 instead of Slot1.
However, I'm running into trouble with machines not recovering when
the large jobs finish.
It works in that jobs marked with "+RequiresWholeMachine = True" will
only run in vm1, and the other VMs on each machine are marked as being
in "Owner" status once the job starts. However, when I remove the
RequiresWholeMachine job from the queue, the other VMs get stuck in
the "Owner" state and never return to being unclaimed. Here's my
START condition:
START = ( ( $(CPUIdle) || \
(State != "Unclaimed" && State != "Owner")) \
&& (VirtualMachineID == 1 || vm1_RequiresWholeMachine =!= True) && \
(TARGET.RequiresWholeMachine =!= True || VirtualMachineID ==
1) )
Can anyone spot what I'm doing wrong that's preventing VMs from
returning to the "Unclaimed" state once the RequiresWholeMachine job
is removed? They seem to stay that way until I run 'condor_reconfig'
to force a reload.
--
David Brodbeck
System Administrator, Linguistics
University of Washington