[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] stdout/stderr for evicted jobs



Hrello Maarten,

On 27/07/22 15:42, Maarten.Litmaath@xxxxxxx wrote:
Hi all,

1. How can one make a grid job fail on its very first eviction?
I do not want HTCondor to try and "rescue" such jobs...
Here is what we set at cnaf:
- in the batch/side schedd config of the CE:

# a few boolean conditions for jobs to evict (put on hold)
SecondStart = (NumJobStarts == 1 && JobStatus == 1)
TooMuchDiskÂÂ = (DiskUsage_raw > 35 * (CpusProvisioned ?: RequestCpus) * 1024000) TooMuchRSS = (ResidentSetSize_RAW > 40 * (CpusProvisioned ?: RequestCpus) * 1e6 )
TooMuchTimeÂÂ = (jobstatus == 2 && (time() - JobStartDate > 86400 * 7))

# put on hold a job who meet at least one of the above
SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD:False) || $(SecondStart) || $(TooMuchDisk) || $(TooMuchRSS) || $(TooMuchTime)

#cumbersome way to log which hold reason applied, plus job owner.
SYSTEM_PERIODIC_HOLD_REASON = strcat(Owner,{"",", Second start not allowed",", TooMuchDisk: 35GB/core", ", TooMuchRSS: 20GB/core","Job runtime > 20 days"}[max({int($(SecondStart)),int($(TooMuchDisk))*2,int($(TooMuchRSS))*3,int($(TooMuchTime))*4})])

#purge from the queue jobs on hold for more than 2 hours.
SYSTEM_PERIODIC_REMOVE = ( $(SYSTEM_PERIODIC_REMOVE:False) || (JobStatus == 5 && (CurrentTime - EnteredCurrentStatus > 3600 *2) )) SYSTEM_PERIODIC_REMOVE_REASON = strcat("local job removed by SYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*2), "being in the hold state for 2 hours.","Unclear reason, see ce.conf"))

####

In the condor-ce jobrouter we also set:

JOB_ROUTER_USE_DEPRECATED_ROUTER_ENTRIES = False

# Jobs can only start once.
JOB_ROUTER_TRANSFORM_PeriodicHold @=jrt
 SET Periodic_Hold = (NumJobStarts >= 1 && JobStatus == 1) || NumJobStarts > 1
@jrt

Stefano
[SNIP]