[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Howto set a reasonable SYSTEM_PERIODIC_REMOVE_REASON



I would do something like the following (warning: untested):

RemoveMultipleRunJobs = ( NumJobStarts >= 3 )
RemoveMultipleRunJobs_Reason = "more than 3 failed jobstarts"
RemoveReadyJobs = (( JobStatus == 2 ) && ( ( CurrentTime - EnteredCurrentStatus ) > MaxJobRetirementTime ))
RemoveReadyJobs_Reason = "runtime longer than 9 days"
RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus) > 14 * 24 * 3600) )
RemoveHeldJobs_Reason = "being in hold state for 7 days"

SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs)           || \
                         $(RemoveMultipleRunJobs)    || \
                         $(RemoveReadyJobs)

SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by SYSTEM_PERIODIC_REMOVE due to ", \
				RemoveHeldJobs        ? RemoveHeldJobs_Reason        : \
				RemoveMultipleRunJobs ? RemoveMultipleRunJobs_Reason : \
                                RemoveReadyJobs       ? RemoveReadyJobs_Reason       : \
                                "unknown reason")

-Mat


On 10/1/20 2:59 AM, Beyer, Christoph wrote:
Hi,

this bothered us for a while and maybe it could end up in the recipes somehow (?)

Our system_periodic_remove string looks like this:

RemoveMultipleRunJobs = ( NumJobStarts >= 3 )
RemoveReadyJobs = (( JobStatus == 2 ) && ( ( CurrentTime - EnteredCurrentStatus ) > MaxJobRetirementTime ))
RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus) > 14 * 24 * 3600) )
SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs)           || \
                          $(RemoveMultipleRunJobs)    || \
                          $(RemoveReadyJobs)

The default SYSTEM_PERIODIC_REMOVE_REASON looks like this:
ShadowLog.old:09/30/20 07:25:58 (9862228.0) (1574665): Job 9862228.0 is being removed: The system macro SYSTEM_PERIODIC_REMOVE expression '((JobStatus == 5 && (CurrentTime - EnteredCurrentStatus) > 14 * 24 * 3600)) || (NumJobStarts >= 3) || ((JobStatus == 2) && ((CurrentTime - EnteredCurrentStatus) > MaxJobRetirementTime))' evaluated to TRUE

Which does not really mean anything to the user and even as an admin you need to recheck the job classadds to reveal the actual remove reason.

This sets the SYSTEM_PERIODIC_REMOVE_REASON according to the remove-reason (who would have thought)

SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by SYSTEM_PERIODIC_REMOVE due to ", \
ifThenElse(JobStatus == 2 && CurrentTime - EnteredCurrentStatus > 3600*24*9, \
"runtime being longer than 9 days", \
ifThenElse(JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*24*6, \
"being in hold state for 7 days", \
"more than 3 failed jobstarts") \
) )

(of course it is the similar syntax for the SYSTEM_PERIODIC_HOLD_REASON)

Tested in $CondorVersion: 8.9.3

Best
christoph