[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Taking action on job breaching scratch space limit



Thanks. it helped to remove the configuration error.Â

When I run the job with following configuration:Â

MODIFY_REQUEST_EXPR_REQUESTDISK = $(DiskPerCore) / 320

STARTD_JOB_ATTRS = $(STARTD_JOB_ATTRS) RequestDisk DiskUsage

DISK_USAGE_EXCEEDED = (DiskUsage =!= UNDEFINED && DiskUsage > RequestDisk)
use POLICY : WANT_HOLD_IF (DISK_USAGE_EXCEEDED,105,job exceeded requested disk)

During the job run time, I still requestdiskÂset to 1 on worker, also it doesn't put the job on hold.Â

# condor_who -af RequestDisk diskusage disk


1 5120023 5113327
1 5120023 5113327

Thanks & Regards,
Vikrant Aggarwal


On Wed, Apr 7, 2021 at 5:45 PM <tomerp@xxxxxxxxxxx> wrote:
Please try

use POLICY : WANT_HOLD_IF (DISK_USAGE_EXCEEDEDÂ, 105, job exceeded requested disk)

Â(This is without quotation marks around the error message)

Tomer.

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of ervikrant06@xxxxxxxxx <ervikrant06@xxxxxxxxx>
Sent: Wednesday, April 7, 2021 3:00 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Taking action on job breaching scratch space limit
Â
Thanks for inputs.Â

Tomer, I tried your configuration, it doesn'tÂwork for me. On the worker node I can see the diskusage is 5GB and the amount of disk allocated to the job is 3GB. RequestDisk is 1GB.Â

# condor_who -af RequestDisk diskusage disk
1000000 5120014 3133873
1000000 5120014 3133873

Modified following parameter also on both submit and worker node. Before this modification, it was showing me requestdisk value on the worker node always as 1.Â

JOB_DEFAULT_REQUESTdisk = 1000000

I am using condor version 8.5.8 which is old but it does seem to have support for policy as per condor_config_val use policy:want_hold_if.Â

With condor 8.8.5 (newer) version using exact same configuration seeing following message inÂ/var/log/condor/StartLog which is making startdÂdead.Â

04/07/21 07:45:39 ERROR "Syntax error in WANT_HOLD_REASON _expression_: 'ifThenElse((DiskUsage =!= UNDEFINED && DiskUsage > RequestDisk), ""job exceeded requested disk"", UNDEFINED)'" at line 571 in file /slots/23/dir_1952943/userdir/.tmprZa7ap/BUILD/condor-8.8.5/src/condor_startd.V6/util.cpp



Thanks & Regards,
Vikrant Aggarwal


On Tue, Apr 6, 2021 at 8:22 PM <tomerp@xxxxxxxxxxx> wrote:
Hi Vikrant,

I'm using the following for the same purpose

STARTD_JOB_ATTRS = $(STARTD_JOB_ATTRS) RequestDisk
DISK_USAGE_EXCEEDED = (JobUniverse !=13 && DiskUsage =!= UNDEFINED && DiskUsage > RequestDisk)
use POLICY : WANT_HOLD_IF (DISK_USAGE_EXCEEDEDÂ, 105, "job exceeded requested disk")

Tomer.



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of ervikrant06@xxxxxxxxx <ervikrant06@xxxxxxxxx>
Sent: Tuesday, April 6, 2021 5:36 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Taking action on job breaching scratch space limit
Â
Hello Experts,

I am trying to take action on the job using more scratch space than allocated.ÂÂ

CondorDiskSpace attribute is coming from the dynamic script which is called through condor cron. Script is included before putting this configuration.Â

CondorDiskSpace = $(CondorDiskSpace:1024)
DiskPerCore = Â$(CondorDiskSpace) / $(NUM_CPUS)
chunksdisk = ifThenElse( RequestDisk <= ($(DiskPerCore) * RequestCpus), quantize(RequestCpus, {1}), quantize(RequestDisk, {$(DiskPerCore)}) / $(DiskPerCore))
MODIFY_REQUEST_EXPR_REQUESTDISK = $(chunksdisk) * $(DiskPerCore)

Commented the last two lines in above configuration and added following to keep the test case simple.Â

MODIFY_REQUEST_EXPR_REQUESTDISK = $(DiskPerCore) / 160
ÂSTARTD_JOB_ATTRS = $(STARTD_JOB_ATTRS) TowerTeam DiskUsageÂÂ

Confirmed from the output of condor_who -l that Disk and DiskUsage appears in output. However, still using scratchÂspace more than allocated,Âthe job keeps on running.ÂÂ

DiskUsageBreach = (DiskUsage > Disk)
WANT_HOLD = ($(PREEMPT) || $(DiskUsageBreach))

How can we put the job on hold in this scenario once the job uses more disk space than allocated from scratch directory?Â

Thanks & Regards,
Vikrant Aggarwal
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/