[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Advice on job suspension vs preemption with partitionable slots



Hi,

we use partitionable slots in our pool, because of large variance in job
requirements -- that works well. In addition, we have a subset of jobs
that run substantially longer than others. Moreover, these jobs can't
be snapshotted, hence eviction is expensive.

The machines have plenty of memory and could tolerate a dozen of these
jobs idling, therefore I'd like to see them suspended rather than
evicted. The solution to the problem seems to be to over-advertise cpus
and split (virtual) resources into two slot types, where a high-priority
slot suspends jobs on a low-priority slow. For example as described
here:

https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToSuspendJobs

However, I fail to grasp how this can be integrated into a setup with
partitionable slots. Does anybody have an example configuration for a
similar setup.

An alternative solution may be to "downgrade" the default preemption
settings to suspension and disable premption for precious jobs completely
(or until an insane priority difference is reached). Something like:

PREEMPT = False
PREEMPTION_REQUIREMENTS = ((RemoteUserPrio > $(INSANE_THRESHOLD)) || NiceUser == True)
WANT_SUSPEND = ( (TARGET.PleasePleaseSuspend =?= TRUE) || (MY.NiceUser == True) )
SUSPEND = ( $(StateTimer) > (1 * $(HOUR)) && RemoteUserPrio > TARGET.SubmitterUserPrio * 1.2 ) || (MY.NiceUser == True)
CONTINUE = ( $(SUSPEND) =!= True )
MAXSUSPENDTIME = 48 * $(HOUR)

Does this sound sane?

Thanks in advance,

Michael

-- 
Michael Hanke
http://mih.voxindeserto.de