[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CommitedTime stays at 0




Hi Martin,

Re the below :

** As far as the best way to put a time limit on long jobs :

I would suggest adding "allowed_execute_duration" into your job submit file description.  From the Manual:
allowed_execute_duration = <integer>
 The longest time for which a job may be executing. Jobs which exceed this duration will go on hold. This time does not include file-transfer time. Jobs which self-checkpoint have this long to write out each checkpoint. This attribute is intended to help minimize the time wasted by jobs which may erroneously run forever."

However, allowed_execute_duration was added in HTCondor v9.7.0, so I guess it will not help if you need to keep running v9.0.x for some reason, which is pretty far behind the times at this point. 

If you must keep running v9.0.x, and you control the configuration of your execution points (startds, i.e. the worker nodes), you could add to the HTCondor configuration of your execution points one of the following which will limit the runtime of any job that lands there:


** As far as issues with the job attribute "CommittedTime" :

CommittedTime historically worked only with "standard" universe jobs from back in HTCondor v8.x and earlier, which has made this job attribute useless for the past couple years.  However, this has finally been rectified in HTCondor v10.8.0 and above --- from the version history for HTCondor v10.8.0 :

"Self-checkpointing jobs may now include the time spent generating successfully-stored checkpoints as part of their `CommittedTime` job ad attribute."

Note that HTCondor v10.8.0 is scheduled to be released this week, perhaps later today.

Ticket is at https://opensciencegrid.atlassian.net/browse/HTCONDOR-1942

Hope the above helps,
Todd


On 9/12/2023 11:54 AM, Beaumont, Martin wrote:

Oops, forgot to mention. This is for:

 

$CondorVersion: 9.0.17 May 27 2023 BuildID: 649540 PackageID: 9.0.17-3 $

$CondorPlatform: x86_64_Rocky8 $

 

Martin

 

From: Beaumont, Martin
Sent: September 12, 2023 12:32 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: CommitedTime stays at 0

 

Hi all,

 

Quick question: is it normal that the Job ClassAd CommittedTime keeps being 0, even after job completion?

 

After a few quick tests:

It stays at 0 with parallel and vanilla jobs while using dynamic partitionable slots.

It stays at 0 with parallel jobs without dynamic partitionable slots.

âcondor_history -longâ finally shows something higher than 0 with serial jobs without dynamic partitionable slots.

 

Iâm trying to find the best way to put a time limit on long jobs, put them on Hold temporarily to let other higher priority queued jobs get their chance, and then release the long jobs to get back in queue.

Keep in mind I have parallel and serial jobs running simultaneously on all execute nodes. So normal pre-empting across all slots doesnât work.

Also, for MPI jobs, âsave pointsâ are the responsibility of the R&D software/wrapper/user to handle. The working dir and apps are all on NFS.

 

So far, I came up with this configuration (timings to be confirmed):

 

---------------------------------------------

# Priorization using 2 groups

NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = True

GROUP_NAMES                  =   low_priority, high_priority

GROUP_QUOTA_low_priority     =   1

GROUP_QUOTA_high_priority    =   1000000

 

# Force submitters to use the priorization groups

SUBMIT_REQUIREMENT_NAMES = accountinggroup

SUBMIT_REQUIREMENT_accountinggroup = IfThenElse( AccountingGroup Isnt UNDEFINED, IfThenElse( stringListMember( AcctGroup, "low_priority, high_priority"), TRUE, FALSE), FALSE)

SUBMIT_REQUIREMENT_accountinggroup_REASON = "accounting_group must be one of: low_priority, high_priority"

 

# Put jobs on Hold if running longer than 2 weeks

#SYSTEM_PERIODIC_HOLD = ( RemoteWallClockTime - CumulativeSuspensionTime ) > 1209600

SYSTEM_PERIODIC_HOLD = ( RemoteUserCpu / RequestCpus ) > 1209600

#SYSTEM_PERIODIC_HOLD = ( CommittedTime - CommittedSuspensionTime ) > 1209600

 

# Release Held jobs every 10mins for a maximum of 5 times

SYSTEM_PERIODIC_RELEASE = (JobRunCount < 5 && (time() - EnteredCurrentStatus) > 600 )

 

# Finally, remove jobs that have been put in Run state 5 times

SYSTEM_PERIODIC_REMOVE = (JobRunCount == 5)

---------------------------------------------

 

I canât use RemoteWallClockTime since it cumulates and does not reset during the Hold/Release process.

I canât substract CommittedSuspensionTime since, like CommittedTime, it stays at 0.

The Cumulative* classads donât seem to update during job execution.

I donât understand how to use AllowedJobDuration as they donât show up in my jobs classads by default. Iâd like to manage this from my side (config file), not the userâs (submit file).

The best work around I found was to use RemoteUserCpu and RequestCpus, but doing so will exclude the possibility of a bugged job that is sitting there without using CPU time.

 

Any suggestions?

 

Thanks!

 

Martin

 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685