[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Configuring shared resources



For example "Team Bioinfo" and "Team Data" get priority access to one GPU server each but can use the other teams GPU server if it's not in use. If Team Data want to use their GPU then any of Team Bioinfo's jobs are evicted & requeued on their own GPU. I've read something about this in the past few weeks but can't see it now. Are there any good examples or docs on how to do this?

We call this "condo" model. The basic idea is that you configure the GPU servers to prefer their owners' jobs, preempting other jobs in order to run them if necessary. Unfortunately, the configuration is totally different, depending on if you're using static or dynamic slots.


For static slots, do something like the following on the EPs:

	# Prefer jobs from team bioinfo.
	RANK = (AcctGroup == "bioinfo") * 1000
	# Or uncomment to prefer jobs from team data.
	# RANK = (AcctGroup == "bioinfo") * 1000

	# When it's time to go, it's time to go.
	MAXJOBRETIREMENTTIME = 0

The default value of NEGOTIATOR_PRE_JOB_RANK includes the machine's RANK for the job, so jobs which match either GPU machine will run on the one
that one preempt them, if given the chance.


For dynamic slots, the above will work, but only if all of your jobs are the same "size": dynamic slots will be preempted (kicked off if there's a job from the owners' group) only if the new job fits. If group A's jobs use fewer cores or RAM than group B's, for example, they won't kick group B's jobs off because they can't fit in the slot. Otherwise, you can do something like the following:

# Slot type 1 is partitionable.
use FEATURE : PartitionableSlot(1)

# Slot type 2 is partitionable.
use FEATURE : PartitionableSlot(2)
# Let this slot know if it's using resources assigned to a type-1 slot.
SLOT_TYPE_2_BACKFILL = TRUE
# Kick jobs off type-2 slots if they use any resource in use by a type-1 slot.
SLOT_TYPE_2_PREEMPT = size(ResourceConflict?:"") > 0
# Don't start a "bioinfo" job on a slot that another "bioinfo" job will preempt.
SLOT_TYPE_2_START = (AcctGroup != "bioinfo")

# When it's time to go, it's time to go.
MAXJOBRETIREMENTTIME = 0

If I could define the team memberships via LDAP (lookups in FreeIPA?) or similar it would be even better!

HTCondor doesn't directly support this at the moment, but you can use the AssignAccountingGroup feature to make sure the every job defines the "AcctGroup" attribute:

	use feature:AssignAccountingGroup(map_file_name)

where map_file_name is a file that looks something like the following:

*	user_name1	bioinfo
*	user_name2	data
*	".*"		no_group

Pulling information out of LDAP and converting it into this form is left as an exercise for the reader. ;)

-- ToddM