[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] DedicatedScheduler limited by group quotas



Good Morning,

We have a problem in referring to submitting parallel jobs.

We are trying to limit parallel jobs with user group quotas, as in the case of vanilla jobs, but it seems that when submitting jobs they are assigned to the DedicatedScheduler group/user and not to the corresponding group.

Actually we have this configuration:
# Setting a MapFile
SCHEDD_CLASSAD_USER_MAP_NAMES = $(SCHEDD_CLASSAD_USER_MAP_NAMES) Groups
CLASSAD_USER_MAPFILE_Groups = /etc/condor/groups.map
# Job Transform using owner and ownergroup
JOB_TRANSFORM_AssignGroup @=end
[
 Âcopy_Owner = "AcctGroupUser";
 Âcopy_AcctGroup = "RequestedAcctGroup";
 Âeval_set_AcctGroup = userMap("Groups", AcctGroupUser, AcctGroup);
 Âeval_set_AccountingGroup = join(".",userMap("Groups", AcctGroupUser, AcctGroup), AcctGroupUser);
]
@end
# Prevent Cheating
IMMUTABLE_JOB_ATTRS = $(IMMUTABLE_JOB_ATTRS) AcctGroup AcctGroupUser AccountingGroup
# Require that the user mapped into an accounting group
SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) AssignGroup
SUBMIT_REQUIREMENT_AssignGroup = AcctGroup isnt undefined && AccountingGroup isnt undefined
SUBMIT_REQUIREMENT_AssignGroup_REASON = strcat("Could not map '", Owner, "' to an accounting group ", RequestedAcctGroup)

This config works good when submitting vanilla jobs that assign his group automatically and are limited via quota.

But in the case of parallel jobs, it seems the jobs are sent by DedicatedScheduler user, making it not follow to the concurrency limit, but a seconds later it's assigned to the user's group and limiting the normal jobs.
Before:
root@condor01 ~]# condor_userprio -all                                                                                       Â
Last Priority Update: Â4/20 12:30
Group                   ÂEffective  Config   Use  ÂSubtree  Effective   Real  Priority  Res  Total Usage    Usage       Last    Time Since Requested
 User Name               Â Quota  Quota  Surplus  Quota   Priority  Priority ÂFactor  In Use (wghted-hrs)  ÂStart Time    Usage Time  ÂLast Usage Resources
------------------------------------------ --------- --------- ------- --------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ---------- ----------
ligousers                    5.40   Â0.15 no      5.40         Â 0.67  1000.00   Â1    Â69.51 Â3/19/2021 12:05 Â4/20/2021 12:30   Â<now>     Â1
 ligo001@condor01.###                            Â 666.96  0.67  1000.00   Â1    Â69.51 Â3/19/2021 12:05 Â4/20/2021 12:30   Â<now>     Â
<none>                     Â0.00   Â0.00 yes     36.00          0.50  1000.00   Â2    Â17.91 Â2/18/2021 12:36 Â4/20/2021 12:10  Â0+00:20     Â2
 DedicatedScheduler@condor01.###                       Â 500.00  0.50  1000.00   Â2     2.53 Â4/09/2021 10:44 Â4/20/2021 12:10  Â0+00:20     Â
------------------------------------------ --------- --------- ------- --------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ---------- ----------
Number of users: 2                       ByQuota                        Â3    Â72.04          4/19/2021 12:30  Â0+23:59     Â

######################################################################################################################################

After:
[root@condor01 ~]# condor_userprio -all
Last Priority Update: Â4/20 12:32
Group                   ÂEffective ÂConfig   Use  ÂSubtree  Effective   Real  Priority  Res  Total Usage    Usage       Last    Time Since Requested
 User Name               Â Quota  Quota  Surplus  Quota   Priority  Priority ÂFactor In Use (wghted-hrs)  ÂStart Time    Usage Time  ÂLast Usage Resources
------------------------------------------ --------- --------- ------- --------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ---------- ----------
ligousers                    5.40   Â0.15 no    Â 5.40         Â 0.67  1000.00   Â3    Â69.58 Â3/19/2021 12:05 Â4/20/2021 12:32   Â<now>     Â1
 ligo001@condor01.###                            Â 668.22  Â 0.67  1000.00   Â3    Â69.58 Â3/19/2021 12:05 Â4/20/2021 12:32   Â<now>     Â
<none>                     Â0.00   Â0.00 yes     36.00         Â 0.50  1000.00   Â0    Â17.95 Â2/18/2021 12:36 Â4/20/2021 12:31  Â0+00:00     Â0
 DedicatedScheduler@condor01.###                       Â 500.00   0.50  1000.00   Â2     2.53 Â4/09/2021 10:44 Â4/20/2021 12:10  Â0+00:22     Â
------------------------------------------ --------- --------- ------- --------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ---------- ----------
Number of users: 2                       ByQuota                        Â5    Â72.11          4/19/2021 12:32  Â0+23:59     Â


And here the fields of this parallel job:
[root@condor01 ~]# condor_q -l                                                  ÂÂ
AccountingGroup = "ligousers.ligo001"
AcctGroup = "ligousers"
AcctGroupUser = "ligo001"
AllRemoteHosts = "slot1@condor02.###,slot2@condor02.###"
Args = "120"
BufferBlockSize = 32768
BufferSize = 524288
BytesRecvd = 66256.0
BytesSent = 0.0
ClusterId = 352
Cmd = "/bin/sleep"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0
CompletionDate = 0
CondorPlatform = "$CondorPlatform: x86_64_CentOS7 $"
CondorVersion = "$CondorVersion: 8.8.12 Nov 24 2020 BuildID: 524104 PackageID: 8.8.12-1 $"
CoreSize = 0
CumulativeRemoteSysCpu = 0.0
CumulativeRemoteUserCpu = 0.0
CumulativeSlotTime = 0
CumulativeSuspensionTime = 0
CurrentHosts = 2
DiskUsage = 35
DiskUsage_RAW = 33
EncryptExecuteDirectory = false
EnteredCurrentStatus = 1618913343
Environment = ""
Err = "/dev/null"
ExecutableSize = 35
ExecutableSize_RAW = 33
ExitBySignal = false
ExitStatus = 0
FileSystemDomain = "condor01.inv.usc.es"
flavour = error
GlobalJobId = "condor01.### #352.0#1618913343"
ImageSize = 35
ImageSize_RAW = 33
In = "/dev/null"
Iwd = "/home2/ligo001/submit_test"
JobBatchName = "ParallelJob02_ligo1"
JobCurrentStartDate = 1618913343
JobLeaseDuration = 2400
JobMemoryLimit = 1.5 * RequestMemory
JobNotification = 0
JobPrio = 0
JobRequiresSandbox = true
JobRunCount = 1
JobStartDate = 1618913343
JobStatus = 2
JobUniverse = 11
JobWallTimeLimit = 28800
KillSig = "SIGTERM"
LastJobLeaseRenewal = 1618913344
LastJobStatus = 1
LastMatchTime = 1618913343
LastSuspensionTime = 0
LeaveJobInQueue = false
LocalSysCpu = 0.0
LocalUserCpu = 0.0
MachineAttrCpus0 = 1
MachineAttrSlotWeight0 = 1
MaxHosts = 2
MinHosts = 2
MyType = "Job"
NiceUser = false
NumCkpts = 0
NumCkpts_RAW = 0
NumJobCompletions = 0
NumJobMatches = 1
NumJobStarts = 0
NumRestarts = 0
NumShadowStarts = 1
NumSystemHolds = 0
>>OrigMaxHosts = 2
Out = "/dev/null"
Owner = "ligo001"
PeriodicHold = false
PeriodicRelease = false
PeriodicRemove = (JobStatus == 2 && RemoveWallTime) || (JobStatus == 2 && RemoveMemory)
ProcId = 0
PublicClaimId = "<####:9618?addrs=####-9618&noUDP&sock=167134_99d2_3>#1614687131#1122#..."
PublicClaimIds = "<####:9618?addrs=####-9618&noUDP&sock=167134_99d2_3>#1614687131#1122#...,<####:9618?addrs=####-9618&noUDP&sock=167134_99d2_3>#1614687131#1131#..."
QDate = 1618913343
Rank = 0.0
RemoteHost = "slot1@condor02.###"
RemoteHosts = "slot1@condor02.###,slot2@condor02.###"
RemoteSlotID = 1
RemoteSysCpu = 0.0
RemoteUserCpu = 0.0
RemoteWallClockTime = 0.0
RemoveMemory = ifThenElse(ResidentSetSize =!= undefined,ResidentSetSize > JobMemoryLimit * 1024,false)
RemoveWallTime = (time() - EnteredCurrentStatus) > JobWallTimeLimit
RequestCpus = 1
RequestDisk = DiskUsage
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTran$
fer))
RootDir = "/"
Scheduler = "DedicatedScheduler@condor01.###"
ServerTime = 1618913352
ShadowBday = 1618913343
ShouldTransferFiles = "IF_NEEDED"
StartdIpAddr = "<####:9618?addrs=####-9618&noUDP&sock=167134_99d2_3>"
StartdPrincipal = "execute-side@matchsession/####"
TargetType = "Machine"
TotalSubmitProcs = 1
TotalSuspensions = 0
TransferErr = false
TransferIn = false
TransferInputSizeMB = 0
TransferOut = false
TransferQueued = false
TransferringInput = false
User = "ligo001@condor01.###"
UserLog = "/home2/ligo001/submit_test/log"
WantCheckpoint = false
WantIOProxy = true
WantRemoteIO = true
WantRemoteSyscalls = false
WhenToTransferOutput = "ON_EXIT"

Is it possible that parallel jobs work like normal jobs?

Thanks for advance,
Pau
---------------------------------------------------
Pau Ruiz
Tel. +34 931640488