[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Group Quota and dynamic Slot.



We have seen a very similar problem to #1 at Fermilab.  Reported a few months ago and fix is in the works.

What Christoph said about consumption policy enabling will work as long as your pool is not too big.  They are going 

to refactor the matching code also if consumption policies are not in place.  (We do not use consumption policies).

We don't run pre-emption and so have no experience on #2.


Steve



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of "Geonmo Ryu" <geonmo@xxxxxxxxxxx>
Sent: Tuesday, September 4, 2018 11:25:51 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Group Quota and dynamic Slot.
 

Hello, HTCondor users.

 

We have trouble to set up our HTCondor systems.

 

We want to configure HTCondor to apply a group management policy that guarantees minimum slots for each group.

 

In the Static Slot mode, it seems that the settings were working.

 

However, it did not work well with the dynamic slot mode.

 

 

The problem is as follows.

1. If the job's queue is small, it will not run at all.

 

For example, if the work queue is small, as shown below, condor_q -better-analyze <jobID> would look like this:

This means that the three machines are matched, but it is no machine available to run.


    DiskUsage = 1

    ImageSize = 1

    RequestDisk = DiskUsage

    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,( ImageSize + 1023 ) / 1024)

The Requirements _expression_ for job 63.000 reduces to these conditions:

         Slots

Step    Matched  Condition

-----  --------  ---------

[0]           3  HasSingularity == true

[1]           3  TARGET.Arch == "X86_64"

[3]           3  TARGET.OpSys == "LINUX"

[5]           3  TARGET.Disk >= RequestDisk

[7]           3  TARGET.Memory >= RequestMemory

[9]           3  TARGET.HasFileTransfer

No successful match recorded.

Last failed match: Wed Sep  5 13:07:36 2018

Reason for last match failure: no match found 

063.000:  Run analysis summary ignoring user priority.  Of 3 machines,

      0 are rejected by your job's requirements 

      0 reject your job because of their own requirements 

      0 match and are already running your jobs 

      0 match but are serving other users 

      0 are available to run your job


 

2. When a job is already running but it consumes more than its minimum guaranteed slot number, we want that another group can take the slots from existing jobs. However, it did not respond.

 

The settings we have set are as follows.

######## Central Manager ###########

~~~~ Auth. ~~~~

NEGOTIATOR_INTERVAL = 20

TRUST_UID_DOMAIN = TRUE

START = TRUE

SUSPEND = FALSE

PREEMPT = TRUE

KILL = FALSE

REQUIRE_LOCAL_CONFIG_FILE = False

GROUP_NAMES = group_alice, group_cms

GROUP_QUOTA_group_alice = 84

GROUP_QUOTA_group_cms = 84

GROUP_ACCEPT_SURPLUS = true

NEGOTIATOR_CONSIDER_PREEMPTION = True

PREEMPTION_REQUIREMENTS = True

PREEMPTION_REQUIREMENTS = $(PREEMPTION_REQUIREMENTS) && (((SubmitterGroupResourcesInUse < SubmitterGroupQuota) && (RemoteGroupResourcesInUse > RemoteGroupQuota)) || (SubmitterGroup =?= RemoteGroup))

MAXJOBRETIREMENTTIME = 0

NEGOTIATOR_CONSIDER_EARLY_PREEMPTION = True

NEGOTIATOR_UPDATE_INTERVAL = 60

PREEMPTION_RANK = 2592000 - ifThenElse(isUndefined(TotalJobRuntime),0,TotalJobRuntime)

NEGOTIATOR_POST_JOB_RANK = 1

NEGOTIATOR_PRE_JOB_RANK = 1

PREEMPTION_RANK_STABLE = False

ALLOW_PSLOT_PREEMPTION = True

#DAGMAN_PENDING_REPORT_INTERVAL = 20

DEFRAG_INTERVAL = 60

DEFRAG_UPDATE_INTERVAL = 30

#########################

####### Startd#########

NEGOTIATOR_INTERVAL = 20

TRUST_UID_DOMAIN = TRUE

START = TRUE

SUSPEND = FALSE

PREEMPT = FALSE

KILL = FALSE

REQUIRE_LOCAL_CONFIG_FILE = False

NUM_SLOTS = 1

NUM_SLOTS_TYPE_1 = 1

SLOT_TYPE_1 =  cpus=100%

SLOT_TYPE_1_PARTITIONABLE = true

SINGULARITY_JOB = !isUndefined(TARGET.SingularityImage)

SINGULARITY_IMAGE_EXPR = TARGET.SingularityImage

SINGULARITY_TARGET_DIR = /srv

MOUNT_UNDER_SCRATCH = /tmp, /var/tmp

SINGULARITY_BIND_EXPR=TARGET.SingularityBind

UPDATE_INTERVAL = 10

#MAXJOBRETIREMENTTIME=5

##################

 

Can you find a case where you have made similar settings?

 

We will welcome any comments.

 

Please tell us anything. Thank you.

 

Regards,

 


--------------------------------------------------------------------------------------------------
Geonmo Ryu / 류건모

Korea Institute of Science and Technology Information (KISTI)
Global Science Experimental Data Hub Center (GSDC)
245 Daehak-ro, Yuseong-gu, Daejeon, 305-806, Republic of Korea
Tel :  +82-42-869-1639, +82-10-4337-9423
Mail : geonmo@xxxxxxxxxxx / ry840901@xxxxxxxxx
--------------------------------------------------------------------------------------------------