[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Back to starting point... making my jobs to suspend for as long as required



Hi all,

I haven't hit the correct way to make Condor behave as I need...

If I've read the documentation for 6.6.10 vanilla on Win32 correctly, I can
(somehow!) submit a job that will run if the machine is free, be suspended
if there is user activity and resume later, all this without being
preempted, interrupted (that will mean it'll have to start again from
scratch) and/or killed...

Is the sentece above right for a start???

Also if I understood everything correctly, I've to expect the Status column
of condor_q to show only R, meaning my job is either currently running or
suspended, but NOT to come back to I.

The Activity column of condor_status must show either Busy or Suspended, but
not come back to Idle. Is that correct??

What I've here is that I submit my jobs (I submit them in Hold status and
then manually release them as required), I can see they going into Running
with condor_q, and I can see the machine going to Busy/Suspended.
However, when I check the next day, the machine has come back to Idle (with
user activity, of course).
I expected the machine to be in Suspended status, so my jobs have been
killed.

These are the relevant lines from the default installation in the
condor_config on the nodes (thank you taking the time to look at them ;-):

	CONDOR_HOST = {the IP address of my central manager}
	RELEASE_DIR = C:\Condor
	LOCAL_DIR = C:\Condor
	LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local
	CONDOR_ADMIN = mdilaj@xxxxxxxxxxxx
	MAIL = $(BIN)/condor_mail.exe
	SMTP_SERVER = {the IP address of my SMTP server}
	UID_DOMAIN = $(FULL_HOSTNAME)
	FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)

	HOSTALLOW_ADMINISTRATOR = {the IP address of my central manager}
	HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR)
	HOSTALLOW_READ = {the IP address of my central manager}
	HOSTALLOW_WRITE = {the IP address of my central manager}
	HOSTALLOW_NEGOTIATOR = $(NEGOTIATOR_HOST)
	HOSTALLOW_NEGOTIATOR_SCHEDD = $(NEGOTIATOR_HOST),
$(FLOCK_NEGOTIATOR_HOSTS)
	HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
	HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
	HOSTALLOW_READ_COLLECTOR = $(HOSTALLOW_READ), $(FLOCK_FROM)
	HOSTALLOW_READ_STARTD = $(HOSTALLOW_READ), $(FLOCK_FROM)

	USE_NFS = False
	USE_AFS = False

	MaxSuspendTime = 100 * $(HOUR)

	WANT_SUSPEND = TRUE
	WANT_VACATE = FALSE
	VACATE = FALSE
	START = $(UWCS_START)
	SUSPEND = $(UWCS_SUSPEND)
	CONTINUE = $(UWCS_CONTINUE)
	PREEMPT = FALSE
	KILL = FALSE
	PERIODIC_CHECKPOINT = $(UWCS_PERIODIC_CHECKPOINT)
	PREEMPTION_REQUIREMENTS = $(UWCS_PREEMPTION_REQUIREMENTS)
	PREEMPTION_RANK = $(UWCS_PREEMPTION_RANK)
	NEGOTIATOR_PRE_JOB_RANK = $(UWCS_NEGOTIATOR_PRE_JOB_RANK)
	NEGOTIATOR_POST_JOB_RANK = $(UWCS_NEGOTIATOR_POST_JOB_RANK)

	UWCS_WANT_SUSPEND = ( $(SmallJob) || $(KeyboardNotBusy) \
	|| $(IsPVM) || $(IsVanilla) )
	UWCS_WANT_VACATE = ( $(ActivationTimer) > 10 * $(MINUTE) \
	|| $(IsPVM) || $(IsVanilla) )
	UWCS_START = ( (KeyboardIdle > $(StartIdleTime)) \
	&& ( $(CPUIdle) || \
	(State != "Unclaimed" && State != "Owner")) )
	UWCS_SUSPEND = ( $(KeyboardBusy) || \
	( (CpuBusyTime > 2 * $(MINUTE)) \
	&& $(ActivationTimer) > 90 ) )
	UWCS_CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) \
	&& (KeyboardIdle > $(ContinueIdleTime)) )
	UWCS_PREEMPT = ( ((Activity == "Suspended") && \
	($(ActivityTimer) > $(MaxSuspendTime))) \
	|| (SUSPEND && (WANT_SUSPEND == False)) )
	UWCS_KILL = $(ActivityTimer) > $(MaxVacateTime) 
	UWCS_PERIODIC_CHECKPOINT = $(LastCkpt) > (3 * $(HOUR))
	UWCS_NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED
	UWCS_PREEMPTION_REQUIREMENTS = $(StateTimer) > (1 * $(HOUR)) &&
RemoteUserPrio > SubmittorPrio * 1.2
	UWCS_PREEMPTION_RANK = (RemoteUserPrio * 1000000) - TARGET.ImageSize

	NUM_CPUS = 1

All other active (i.e., non-commented) lines in the default config file are
unmodified, I hope those are not breaking anything.

Any suggestions on how to have a condor grid that can be managed only from
my central manager, where you can issue jobs only from the central manager,
that keeps a job "alive" until it's finished some 4 days later, suspending
it when there's user activity and resuming later, ARE WELCOME ;-)

TIA!
Regards,

Miguel


***********************************************************************************************************
DISCLAIMER:                                                                                                
This e-mail contains proprietary information, some or all of which may be legally privileged.              
It is for the intended recipient only. If an addressing or transmission error has misdirected this e-mail, 
please notify the author by replying to this e-mail. If you are not the intended recipient you may not use,
disclose, distribute, copy, print or rely on this e-mail.                                                  
***********************************************************************************************************