[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs surviving periodic_hold condition



Hi Stefano,

Hope all is well, and perhaps I will see you in Oct!

Re the below, I had an idea about what may be happening:

Currently, the rounding of DiskUsage (and ResidentSetSize) is performed only by the condor_schedd. When DiskUsage is updated in the schedd's copy of the job classad, DiskUsage (by default) is rounded up and the non-rounded value is copied into DiskUsage_RAW.

So using DiskUsage_RAW in your SYSTEM_PERIODIC_HOLD _expression_ is fine unless the job is actually in Running state.  The reason is these job policy expressions are evaluated by the condor_shadow when the job is running, and only evaluated by the scehdd while the job is idle.  The job classad in the shadow will not have DiskUsage_RAW updated while the job is running, which likely explains the behavior you are observing below.

I am thinking it may be good for us to modify the condor_shadow source code such that it also updated the _RAW values to avoid the below problem.

In the meantime, assuming I am guessing the problem correctly, some ideas on how you could work around the issue below:

1. If you control your execution points (EPs), starting with HTCondor v9.11.0, you could have the condor_startd do the checking for DiskUsage by adding the following into the config of your EPs:
   
     use policy: HOLD_IF_DISK_EXCEEDED
        
2. Instead of using "DiskUsage_RAW" in your SYSTEM_PERIODIC_HOLD _expression_, use something like "jobstatus == 2 ? DiskUsage : DiskUsage_RAW".

3. You could just go with "DiskUsage" in your _expression_, which has the (small) downside that perhaps the job will go on hold if the request_disk is very close to the actual usage.

Hope the above helps,
Todd


On 8/22/2022 4:57 AM, Stefano Dal Pra wrote:
Hello,
condor 9.0.13 here.

We observe running jobs that should have been put on hold by the schedd for using too much disk space.
The SYSTEM_PERIODIC_HOLD clause is


SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD:False) || $(SecondStart) || $(TooMuchDisk) || $(TooMuchRSS) || $(TooMuchTime)

And the conditions are:

SecondStart = (NumJobStarts == 1 && JobStatus == 1)
TooMuchDisk   = (DiskUsage_raw > 35 * (CpusProvisioned ?: RequestCpus) * 1024000)
TooMuchRSS = (ResidentSetSize_RAW > 40 * (CpusProvisioned ?: RequestCpus) * 1e6 )
TooMuchTime   = (jobstatus == 2 && (time() - JobStartDate > 86400 * 7))


This usually works but there are jobs at times that survive after going over the
TooMuchDisk condition:

[root@ce03-htc ~]# condor_q -all -cons 'jobstatus == 2 && DiskUsage_RAW/1e6 > 40 * CpusProvisioned' -af:j owner scheddhostname CpusProvisioned 'split(remotehost ?: lastremotehost,".@")[1]' 'DiskUsage_RAW/1e6' 'ImageSize_RAW/1e6' 'interval(time()-jobstartdate)' | sort -n -k 6
7968490.0 pilatlas003 ce03-htc 1 wn-204-13-09-06-a 40.236304 8.760128 1+16:42:51
7968484.0 pilatlas003 ce03-htc 1 wn-204-11-05-04-a 46.117129 8.7592 1+16:42:52
7968498.0 pilatlas003 ce03-htc 1 wn-205-13-01-07-a 47.892963 7.084336 1+16:42:51
7979553.0 pilatlas003 ce03-htc 1 cn-313-06-02 76.593148 8.441587999999999 2:15:05
7979600.0 pilatlas003 ce03-htc 1 cn-313-06-05 76.431658 6.094996 2:03:12
7979204.0 pilatlas003 ce03-htc 1 wn-200-11-11-01-a 98.61471400000001 4.884384 3:32:57
7979205.0 pilatlas003 ce03-htc 1 cn-313-06-08 150.625147 33.13694 3:32:19

Here job 7979205.0 is taking 150.6GB disk space.

I verified that after a condor_restart the shadow for these jobs detect their condition and put them on hold:

[root@ce02-htc ~]# systemctl restart condor && tail -f /var/log/condor/SchedLog | egrep '8039630.0|8063116.0|8064391.0|8064828.0|8066107.0'

08/22/22 11:17:20 (pid:3385936) Starting add_shadow_birthdate(8039630.0)

[...]
08/22/22 11:18:22 (pid:3385936) Shadow pid 3391245 for job 8039630.0 exited with status 112
08/22/22 11:18:22 (pid:3385936) Putting job 8039630.0 on hold

[root@ce02-htc ~]# condor_q 8039630.0 -af holdreason
pilatlas002, TooMuchDisk: 35GB/core


This makes me think that somehow the shadow process might fail at detecting a condition once and for all the next attempts.

I think that the
TooMuchDisk is not the only one, and this can also happen with other checks (TooMuchRSS, for example).

Is there some way to force a "
PERIODIC_HOLD recheck" for a particular job, or any other suggested check?
Thanks
Stefano

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/