[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] jobs surviving periodic_hold condition



Hello,
condor 9.0.13 here.

We observe running jobs that should have been put on hold by the schedd for using too much disk space.
The SYSTEM_PERIODIC_HOLD clause is


SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD:False) || $(SecondStart) || $(TooMuchDisk) || $(TooMuchRSS) || $(TooMuchTime)

And the conditions are:

SecondStart = (NumJobStarts == 1 && JobStatus == 1)
TooMuchDisk ÂÂ= (DiskUsage_raw > 35 * (CpusProvisioned ?: RequestCpus) * 1024000)
TooMuchRSS = (ResidentSetSize_RAW > 40 * (CpusProvisioned ?: RequestCpus) * 1e6 )
TooMuchTime ÂÂ= (jobstatus == 2 && (time() - JobStartDate > 86400 * 7))


This usually works but there are jobs at times that survive after going over the
TooMuchDisk condition:

[root@ce03-htc ~]# condor_q -all -cons 'jobstatus == 2 && DiskUsage_RAW/1e6 > 40 * CpusProvisioned' -af:j owner scheddhostname CpusProvisioned 'split(remotehost ?: lastremotehost,".@")[1]' 'DiskUsage_RAW/1e6' 'ImageSize_RAW/1e6' 'interval(time()-jobstartdate)' | sort -n -k 6
7968490.0 pilatlas003 ce03-htc 1 wn-204-13-09-06-a 40.236304 8.760128 1+16:42:51
7968484.0 pilatlas003 ce03-htc 1 wn-204-11-05-04-a 46.117129 8.7592 1+16:42:52
7968498.0 pilatlas003 ce03-htc 1 wn-205-13-01-07-a 47.892963 7.084336 1+16:42:51
7979553.0 pilatlas003 ce03-htc 1 cn-313-06-02 76.593148 8.441587999999999 2:15:05
7979600.0 pilatlas003 ce03-htc 1 cn-313-06-05 76.431658 6.094996 2:03:12
7979204.0 pilatlas003 ce03-htc 1 wn-200-11-11-01-a 98.61471400000001 4.884384 3:32:57
7979205.0 pilatlas003 ce03-htc 1 cn-313-06-08 150.625147 33.13694 3:32:19

Here job 7979205.0 is taking 150.6GB disk space.

I verified that after a condor_restart the shadow for these jobs detect their condition and put them on hold:

[root@ce02-htc ~]# systemctl restart condor && tail -f /var/log/condor/SchedLog | egrep '8039630.0|8063116.0|8064391.0|8064828.0|8066107.0'

08/22/22 11:17:20 (pid:3385936) Starting add_shadow_birthdate(8039630.0)

[...]
08/22/22 11:18:22 (pid:3385936) Shadow pid 3391245 for job 8039630.0 exited with status 112
08/22/22 11:18:22 (pid:3385936) Putting job 8039630.0 on hold

[root@ce02-htc ~]# condor_q 8039630.0 -af holdreason
pilatlas002, TooMuchDisk: 35GB/core


This makes me think that somehow the shadow process might fail at detecting a condition once and for all the next attempts.

I think that the
TooMuchDisk is not the only one, and this can also happen with other checks (TooMuchRSS, for example).

Is there some way to force a "
PERIODIC_HOLD recheck" for a particular job, or any other suggested check?
Thanks
Stefano