[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DedicatedScheduler limited by group quotas



Hi Greg,

In this case, I want the parallel jobs to be limited by the quota of the user who sent it, for example, if I submitted 10 jobs in parallel and the maximum quota of this group was 5 slots, then the job wouldn't run and would set at idle state, but in my case the job doesn't stop and start to run.
At this moment, both jobs share the quota, but the parallel jobs aren't limited by "Effective Quota", only the serial jobs.

Or instead, it's possible to be able to send parallel jobs with the user and not with the DedicatedScheduler, like the serial jobs

PD: Autoregroup and Surplus were set to False and the previous example as if it had more than 6 slots running.

Thanks,
Pau

Missatge de Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx> del dia dj., 22 dâabr. 2021 a les 18:07:


Hi Pau:


The assumption in HTCondor's parallel universe is that the parallel jobs should take priority over the serial jobs, so that we don't get deadlock trying to schedule the multi-node jobs. To do this, the dedicated scheduler (i.e. the part of the schedd that schedules parallel jobs), asks for resources from the negotiator as the DedicatedScheduler user, and then doles them out.

In this case, do you have users with both serial and parallel jobs, and want to be able to share quota between the two jobs types?


-greg

On 4/22/21 5:17 AM, Pau Ruiz GironÃs wrote:
Good Morning,

We have a problem in referring to submitting parallel jobs.

We are trying to limit parallel jobs with user group quotas, as in the case of vanilla jobs, but it seems that when submitting jobs they are assigned to the DedicatedScheduler group/user and not to the corresponding group.

Actually we have this configuration:
# Setting a MapFile
SCHEDD_CLASSAD_USER_MAP_NAMES = $(SCHEDD_CLASSAD_USER_MAP_NAMES) Groups
CLASSAD_USER_MAPFILE_Groups = /etc/condor/groups.map
# Job Transform using owner and ownergroup
JOB_TRANSFORM_AssignGroup @=end
[
 Âcopy_Owner = "AcctGroupUser";
 Âcopy_AcctGroup = "RequestedAcctGroup";
 Âeval_set_AcctGroup = userMap("Groups", AcctGroupUser, AcctGroup);
 Âeval_set_AccountingGroup = join(".",userMap("Groups", AcctGroupUser, AcctGroup), AcctGroupUser);
]
@end
# Prevent Cheating
IMMUTABLE_JOB_ATTRS = $(IMMUTABLE_JOB_ATTRS) AcctGroup AcctGroupUser AccountingGroup
# Require that the user mapped into an accounting group
SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) AssignGroup
SUBMIT_REQUIREMENT_AssignGroup = AcctGroup isnt undefined && AccountingGroup isnt undefined
SUBMIT_REQUIREMENT_AssignGroup_REASON = strcat("Could not map '", Owner, "' to an accounting group ", RequestedAcctGroup)

This config works good when submitting vanilla jobs that assign his group automatically and are limited via quota.

But in the case of parallel jobs, it seems the jobs are sent by DedicatedScheduler user, making it not follow to the concurrency limit, but a seconds later it's assigned to the user's group and limiting the normal jobs.
Before:
root@condor01 ~]# condor_userprio -all                                                                                       Â
Last Priority Update: Â4/20 12:30
Group                   ÂEffective  Config   Use  ÂSubtree  Effective   Real  Priority  Res  Total Usage    Usage       Last    Time Since Requested
 User Name               Â Quota  Quota  Surplus  Quota   Priority  Priority ÂFactor  In Use (wghted-hrs)  ÂStart Time    Usage Time  ÂLast Usage Resources
------------------------------------------ --------- --------- ------- --------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ---------- ----------
ligousers                    5.40   Â0.15 no      5.40         Â 0.67  1000.00   Â1    Â69.51 Â3/19/2021 12:05 Â4/20/2021 12:30   Â<now>     Â1
 ligo001@condor01.###                            Â 666.96  0.67  1000.00   Â1    Â69.51 Â3/19/2021 12:05 Â4/20/2021 12:30   Â<now>     Â
<none>                     Â0.00   Â0.00 yes     36.00          0.50  1000.00   Â2    Â17.91 Â2/18/2021 12:36 Â4/20/2021 12:10  Â0+00:20     Â2
 DedicatedScheduler@condor01.###                       Â 500.00  0.50  1000.00   Â2     2.53 Â4/09/2021 10:44 Â4/20/2021 12:10  Â0+00:20     Â
------------------------------------------ --------- --------- ------- --------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ---------- ----------
Number of users: 2                       ByQuota                        Â3    Â72.04          4/19/2021 12:30  Â0+23:59     Â

######################################################################################################################################

After:
[root@condor01 ~]# condor_userprio -all
Last Priority Update: Â4/20 12:32
Group                   ÂEffective ÂConfig   Use  ÂSubtree  Effective   Real  Priority  Res  Total Usage    Usage       Last    Time Since Requested
 User Name               Â Quota  Quota  Surplus  Quota   Priority  Priority ÂFactor In Use (wghted-hrs)  ÂStart Time    Usage Time  ÂLast Usage Resources
------------------------------------------ --------- --------- ------- --------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ---------- ----------
ligousers                    5.40   Â0.15 no    Â 5.40         Â 0.67  1000.00   Â3    Â69.58 Â3/19/2021 12:05 Â4/20/2021 12:32   Â<now>     Â1
 ligo001@condor01.###                            Â 668.22  Â 0.67  1000.00   Â3    Â69.58 Â3/19/2021 12:05 Â4/20/2021 12:32   Â<now>     Â
<none>                     Â0.00   Â0.00 yes     36.00         Â 0.50  1000.00   Â0    Â17.95 Â2/18/2021 12:36 Â4/20/2021 12:31  Â0+00:00     Â0
 DedicatedScheduler@condor01.###                       Â 500.00   0.50  1000.00   Â2     2.53 Â4/09/2021 10:44 Â4/20/2021 12:10  Â0+00:22     Â
------------------------------------------ --------- --------- ------- --------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ---------- ----------
Number of users: 2                       ByQuota                        Â5    Â72.11          4/19/2021 12:32  Â0+23:59     Â


And here the fields of this parallel job:
[root@condor01 ~]# condor_q -l                                                  ÂÂ
AccountingGroup = "ligousers.ligo001"
AcctGroup = "ligousers"
AcctGroupUser = "ligo001"
AllRemoteHosts = "slot1@condor02.###,slot2@condor02.###"
Args = "120"
BufferBlockSize = 32768
BufferSize = 524288
BytesRecvd = 66256.0
BytesSent = 0.0
ClusterId = 352
Cmd = "/bin/sleep"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0
CompletionDate = 0
CondorPlatform = "$CondorPlatform: x86_64_CentOS7 $"
CondorVersion = "$CondorVersion: 8.8.12 Nov 24 2020 BuildID: 524104 PackageID: 8.8.12-1 $"
CoreSize = 0
CumulativeRemoteSysCpu = 0.0
CumulativeRemoteUserCpu = 0.0
CumulativeSlotTime = 0
CumulativeSuspensionTime = 0
CurrentHosts = 2
DiskUsage = 35
DiskUsage_RAW = 33
EncryptExecuteDirectory = false
EnteredCurrentStatus = 1618913343
Environment = ""
Err = "/dev/null"
ExecutableSize = 35
ExecutableSize_RAW = 33
ExitBySignal = false
ExitStatus = 0
FileSystemDomain = "condor01.inv.usc.es"
flavour = error
GlobalJobId = "condor01.### #352.0#1618913343"
ImageSize = 35
ImageSize_RAW = 33
In = "/dev/null"
Iwd = "/home2/ligo001/submit_test"
JobBatchName = "ParallelJob02_ligo1"
JobCurrentStartDate = 1618913343
JobLeaseDuration = 2400
JobMemoryLimit = 1.5 * RequestMemory
JobNotification = 0
JobPrio = 0
JobRequiresSandbox = true
JobRunCount = 1
JobStartDate = 1618913343
JobStatus = 2
JobUniverse = 11
JobWallTimeLimit = 28800
KillSig = "SIGTERM"
LastJobLeaseRenewal = 1618913344
LastJobStatus = 1
LastMatchTime = 1618913343
LastSuspensionTime = 0
LeaveJobInQueue = false
LocalSysCpu = 0.0
LocalUserCpu = 0.0
MachineAttrCpus0 = 1
MachineAttrSlotWeight0 = 1
MaxHosts = 2
MinHosts = 2
MyType = "Job"
NiceUser = false
NumCkpts = 0
NumCkpts_RAW = 0
NumJobCompletions = 0
NumJobMatches = 1
NumJobStarts = 0
NumRestarts = 0
NumShadowStarts = 1
NumSystemHolds = 0
> > OrigMaxHosts = 2
Out = "/dev/null"
Owner = "ligo001"
PeriodicHold = false
PeriodicRelease = false
PeriodicRemove = (JobStatus == 2 && RemoveWallTime) || (JobStatus == 2 && RemoveMemory)
ProcId = 0
PublicClaimId = "<####:9618?addrs=####-9618&noUDP&sock=167134_99d2_3>#1614687131#1122#..."
PublicClaimIds = "<####:9618?addrs=####-9618&noUDP&sock=167134_99d2_3>#1614687131#1122#...,<####:9618?addrs=####-9618&noUDP&sock=167134_99d2_3>#1614687131#1131#..."
QDate = 1618913343
Rank = 0.0
RemoteHost = "slot1@condor02.###"
RemoteHosts = "slot1@condor02.###,slot2@condor02.###"
RemoteSlotID = 1
RemoteSysCpu = 0.0
RemoteUserCpu = 0.0
RemoteWallClockTime = 0.0
RemoveMemory = ifThenElse(ResidentSetSize =!= undefined,ResidentSetSize > JobMemoryLimit * 1024,false)
RemoveWallTime = (time() - EnteredCurrentStatus) > JobWallTimeLimit
RequestCpus = 1
RequestDisk = DiskUsage
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTran$
fer))
RootDir = "/"
Scheduler = "DedicatedScheduler@condor01.###"
ServerTime = 1618913352
ShadowBday = 1618913343
ShouldTransferFiles = "IF_NEEDED"
StartdIpAddr = "<####:9618?addrs=####-9618&noUDP&sock=167134_99d2_3>"
StartdPrincipal = "execute-side@matchsession/####"
TargetType = "Machine"
TotalSubmitProcs = 1
TotalSuspensions = 0
TransferErr = false
TransferIn = false
TransferInputSizeMB = 0
TransferOut = false
TransferQueued = false
TransferringInput = false
User = "ligo001@condor01.###"
UserLog = "/home2/ligo001/submit_test/log"
WantCheckpoint = false
WantIOProxy = true
WantRemoteIO = true
WantRemoteSyscalls = false
WhenToTransferOutput = "ON_EXIT"

Is it possible that parallel jobs work like normal jobs?

Thanks for advance,
Pau
---------------------------------------------------
Pau Ruiz
Tel. +34 931640488

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/