[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CommitedTime stays at 0



Oops, forgot to mention. This is for:

 

$CondorVersion: 9.0.17 May 27 2023 BuildID: 649540 PackageID: 9.0.17-3 $

$CondorPlatform: x86_64_Rocky8 $

 

Martin

 

From: Beaumont, Martin
Sent: September 12, 2023 12:32 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: CommitedTime stays at 0

 

Hi all,

 

Quick question: is it normal that the Job ClassAd CommittedTime keeps being 0, even after job completion?

 

After a few quick tests:

It stays at 0 with parallel and vanilla jobs while using dynamic partitionable slots.

It stays at 0 with parallel jobs without dynamic partitionable slots.

“condor_history -long” finally shows something higher than 0 with serial jobs without dynamic partitionable slots.

 

I’m trying to find the best way to put a time limit on long jobs, put them on Hold temporarily to let other higher priority queued jobs get their chance, and then release the long jobs to get back in queue.

Keep in mind I have parallel and serial jobs running simultaneously on all execute nodes. So normal pre-empting across all slots doesn’t work.

Also, for MPI jobs, “save points” are the responsibility of the R&D software/wrapper/user to handle. The working dir and apps are all on NFS.

 

So far, I came up with this configuration (timings to be confirmed):

 

---------------------------------------------

# Priorization using 2 groups

NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = True

GROUP_NAMES                  =   low_priority, high_priority

GROUP_QUOTA_low_priority     =   1

GROUP_QUOTA_high_priority    =   1000000

 

# Force submitters to use the priorization groups

SUBMIT_REQUIREMENT_NAMES = accountinggroup

SUBMIT_REQUIREMENT_accountinggroup = IfThenElse( AccountingGroup Isnt UNDEFINED, IfThenElse( stringListMember( AcctGroup, "low_priority, high_priority"), TRUE, FALSE), FALSE)

SUBMIT_REQUIREMENT_accountinggroup_REASON = "accounting_group must be one of: low_priority, high_priority"

 

# Put jobs on Hold if running longer than 2 weeks

#SYSTEM_PERIODIC_HOLD = ( RemoteWallClockTime - CumulativeSuspensionTime ) > 1209600

SYSTEM_PERIODIC_HOLD = ( RemoteUserCpu / RequestCpus ) > 1209600

#SYSTEM_PERIODIC_HOLD = ( CommittedTime - CommittedSuspensionTime ) > 1209600

 

# Release Held jobs every 10mins for a maximum of 5 times

SYSTEM_PERIODIC_RELEASE = (JobRunCount < 5 && (time() - EnteredCurrentStatus) > 600 )

 

# Finally, remove jobs that have been put in Run state 5 times

SYSTEM_PERIODIC_REMOVE = (JobRunCount == 5)

---------------------------------------------

 

I can’t use RemoteWallClockTime since it cumulates and does not reset during the Hold/Release process.

I can’t substract CommittedSuspensionTime since, like CommittedTime, it stays at 0.

The Cumulative* classads don’t seem to update during job execution.

I don’t understand how to use AllowedJobDuration as they don’t show up in my jobs classads by default. I’d like to manage this from my side (config file), not the user’s (submit file).

The best work around I found was to use RemoteUserCpu and RequestCpus, but doing so will exclude the possibility of a bugged job that is sitting there without using CPU time.

 

Any suggestions?

 

Thanks!

 

Martin