[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Configuring shared resources



Thanx Todd, that gets me started and gives me something to Google
Regarding configuration for slots and groups, is this best done in condor_config or config.d/01-??.config or condor_config.local or somewhere else?


-- Russell

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Todd L Miller via HTCondor-users
Sent: Tuesday, August 8, 2023 10:17 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Todd L Miller <tlmiller@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Configuring shared resources

> For example "Team Bioinfo" and "Team Data" get priority access to one
> GPU server each but can use the other teams GPU server if it's not in
> use. If Team Data want to use their GPU then any of Team Bioinfo's
> jobs are evicted & requeued on their own GPU. I've read something
> about this in the past few weeks but can't see it now. Are there any
> good examples or docs on how to do this?

        We call this "condo" model.  The basic idea is that you configure the GPU servers to prefer their owners' jobs, preempting other jobs in order to run them if necessary.  Unfortunately, the configuration is totally different, depending on if you're using static or dynamic slots.


For static slots, do something like the following on the EPs:

        # Prefer jobs from team bioinfo.
        RANK = (AcctGroup == "bioinfo") * 1000
        # Or uncomment to prefer jobs from team data.
        # RANK = (AcctGroup == "bioinfo") * 1000

        # When it's time to go, it's time to go.
        MAXJOBRETIREMENTTIME = 0

The default value of NEGOTIATOR_PRE_JOB_RANK includes the machine's RANK for the job, so jobs which match either GPU machine will run on the one that one preempt them, if given the chance.


For dynamic slots, the above will work, but only if all of your jobs are
the same "size": dynamic slots will be preempted (kicked off if there's a
job from the owners' group) only if the new job fits.  If group A's jobs
use fewer cores or RAM than group B's, for example, they won't kick group
B's jobs off because they can't fit in the slot.  Otherwise, you can do
something like the following:

# Slot type 1 is partitionable.
use FEATURE : PartitionableSlot(1)

# Slot type 2 is partitionable.
use FEATURE : PartitionableSlot(2)
# Let this slot know if it's using resources assigned to a type-1 slot.
SLOT_TYPE_2_BACKFILL = TRUE
# Kick jobs off type-2 slots if they use any resource in use by a type-1 slot.
SLOT_TYPE_2_PREEMPT = size(ResourceConflict?:"") > 0
# Don't start a "bioinfo" job on a slot that another "bioinfo" job will preempt.
SLOT_TYPE_2_START = (AcctGroup != "bioinfo")

# When it's time to go, it's time to go.
MAXJOBRETIREMENTTIME = 0

> If I could define the team memberships via LDAP (lookups in FreeIPA?) or
> similar it would be even better!

        HTCondor doesn't directly support this at the moment, but you can
use the AssignAccountingGroup feature to make sure the every job defines
the "AcctGroup" attribute:

        use feature:AssignAccountingGroup(map_file_name)

where map_file_name is a file that looks something like the following:

*       user_name1      bioinfo
*       user_name2      data
*       ".*"            no_group

Pulling information out of LDAP and converting it into this form is left
as an exercise for the reader. ;)

-- ToddM
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/