Re: [HTCondor-users] CommitedTime stays at 0

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Tue, 12 Sep 2023 14:06:47 -0500

From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>

Subject: Re: [HTCondor-users] CommitedTime stays at 0

Hi Martin,

Re the below :

** As far as the best way to put a time limit on long jobs :

I would suggest adding "allowed_execute_duration" into your job submit file description. From the Manual:

allowed_execute_duration = <integer>
The longest time for which a job may be executing. Jobs which exceed this duration will go on hold. This time does not include file-transfer time. Jobs which self-checkpoint have this long to write out each checkpoint. This attribute is intended to help minimize the time wasted by jobs which may erroneously run forever."

However, allowed_execute_duration was added in HTCondor v9.7.0, so I guess it will not help if you need to keep running v9.0.x for some reason, which is pretty far behind the times at this point.

If you must keep running v9.0.x, and you control the configuration of your execution points (startds, i.e. the worker nodes), you could add to the HTCondor configuration of your execution points one of the following which will limit the runtime of any job that lands there:

use policy:Preempt_if_Runtime_Exceeds( limit_in_seconds )

Limits running jobs to a maximum of the specified time using preemption. (The default limit is 24 hours). This will kick the job off the machine, and the job will go back to "Idle" state to be rescheduled to run again.

use policy:Hold_if_Runtime_Exceeds( limit_in_seconds )

Limits running jobs to a maximum of the specified time by placing them on hold immediately (ignoring any job retirement time). (The default limit is 24 hours). Jobs that exceed the specified runtime will go on hold with a hold reason explaining that the max runtime was exceeded.

** As far as issues with the job attribute "CommittedTime" :

CommittedTime historically worked only with "standard" universe jobs from back in HTCondor v8.x and earlier, which has made this job attribute useless for the past couple years. However, this has finally been rectified in HTCondor v10.8.0 and above --- from the version history for HTCondor v10.8.0 :

"Self-checkpointing jobs may now include the time spent generating successfully-stored checkpoints as part of their `CommittedTime` job ad attribute."

Note that HTCondor v10.8.0 is scheduled to be released this week, perhaps later today.

Ticket is at https://opensciencegrid.atlassian.net/browse/HTCONDOR-1942

Hope the above helps,
Todd

On 9/12/2023 11:54 AM, Beaumont, Martin wrote:

Oops, forgot to mention. This is for:

$CondorVersion: 9.0.17 May 27 2023 BuildID: 649540 PackageID: 9.0.17-3 $

$CondorPlatform: x86_64_Rocky8 $

Martin

From: Beaumont, Martin
Sent: September 12, 2023 12:32 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: CommitedTime stays at 0

Hi all,

Quick question: is it normal that the Job ClassAd CommittedTime keeps being 0, even after job completion?

After a few quick tests:

It stays at 0 with parallel and vanilla jobs while using dynamic partitionable slots.

It stays at 0 with parallel jobs without dynamic partitionable slots.

âcondor_history -longâ finally shows something higher than 0 with serial jobs without dynamic partitionable slots.

Iâm trying to find the best way to put a time limit on long jobs, put them on Hold temporarily to let other higher priority queued jobs get their chance, and then release the long jobs to get back in queue.

Keep in mind I have parallel and serial jobs running simultaneously on all execute nodes. So normal pre-empting across all slots doesnât work.

Also, for MPI jobs, âsave pointsâ are the responsibility of the R&D software/wrapper/user to handle. The working dir and apps are all on NFS.

So far, I came up with this configuration (timings to be confirmed):

---------------------------------------------

# Priorization using 2 groups

NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = True

GROUP_NAMES = low_priority, high_priority

GROUP_QUOTA_low_priority = 1

GROUP_QUOTA_high_priority = 1000000

# Force submitters to use the priorization groups

SUBMIT_REQUIREMENT_NAMES = accountinggroup

SUBMIT_REQUIREMENT_accountinggroup = IfThenElse( AccountingGroup Isnt UNDEFINED, IfThenElse( stringListMember( AcctGroup, "low_priority, high_priority"), TRUE, FALSE), FALSE)

SUBMIT_REQUIREMENT_accountinggroup_REASON = "accounting_group must be one of: low_priority, high_priority"

# Put jobs on Hold if running longer than 2 weeks

#SYSTEM_PERIODIC_HOLD = ( RemoteWallClockTime - CumulativeSuspensionTime ) > 1209600

SYSTEM_PERIODIC_HOLD = ( RemoteUserCpu / RequestCpus ) > 1209600

#SYSTEM_PERIODIC_HOLD = ( CommittedTime - CommittedSuspensionTime ) > 1209600

# Release Held jobs every 10mins for a maximum of 5 times

SYSTEM_PERIODIC_RELEASE = (JobRunCount < 5 && (time() - EnteredCurrentStatus) > 600 )

# Finally, remove jobs that have been put in Run state 5 times

SYSTEM_PERIODIC_REMOVE = (JobRunCount == 5)

---------------------------------------------

I canât use RemoteWallClockTime since it cumulates and does not reset during the Hold/Release process.

I canât substract CommittedSuspensionTime since, like CommittedTime, it stays at 0.

The Cumulative* classads donât seem to update during job execution.

I donât understand how to use AllowedJobDuration as they donât show up in my jobs classads by default. Iâd like to manage this from my side (config file), not the userâs (submit file).

The best work around I found was to use RemoteUserCpu and RequestCpus, but doing so will exclude the possibility of a bugged job that is sitting there without using CPU time.

Any suggestions?

Thanks!

Martin

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

-- Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison Center for High Throughput Computing Department of Computer Sciences Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257 Phone: (608) 263-7132 Madison, WI 53706-1685

Mailing List Archives

Public Access

Re: [HTCondor-users] CommitedTime stays at 0