[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem with START configuration for allocating whole machine - VMs stuck in owner state after job is removed



I'm trying to do something along the lines of what's described here, to provide a single-slot option for large jobs:
http://nmi.cs.wisc.edu/node/1482
I'm implementing it on condor 6.8.8, so I've changed the syntax to use VirtualMachineID instead of SlotID and vm1 instead of Slot1.

However, I'm running into trouble with machines not recovering when the large jobs finish.

It works in that jobs marked with "+RequiresWholeMachine = True" will only run in vm1, and the other VMs on each machine are marked as being in "Owner" status once the job starts. However, when I remove the RequiresWholeMachine job from the queue, the other VMs get stuck in the "Owner" state and never return to being unclaimed. Here's my START condition:

START	= ( ( $(CPUIdle) || \
                      (State != "Unclaimed" && State != "Owner")) \
	&& (VirtualMachineID == 1 || vm1_RequiresWholeMachine =!= True) && \
(TARGET.RequiresWholeMachine =!= True || VirtualMachineID == 1) )

Can anyone spot what I'm doing wrong that's preventing VMs from returning to the "Unclaimed" state once the RequiresWholeMachine job is removed? They seem to stay that way until I run 'condor_reconfig' to force a reload.

--

David Brodbeck
System Administrator, Linguistics
University of Washington