[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Disk space consumability


The DiskUsage parameter in the job only seems to account for HTCondor-mediated data transfers, not runtime usage.

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Oliver Freyermuth
Sent: Sunday, November 17, 2019 10:51 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Subject: [External] Re: [HTCondor-users] Disk space consumability

Dear Vincent,

thanks for the idea!
Indeed we are already doing this, marking a node as "unhealthy" (i.e. START expression becomes false) if the usage of the pool directory exceeds 80 %. 
This is not sufficient for the present case, though: Thos jobs are running for a few hours, and slowly filling up local scratch in the order of 100 GB, and may start at a similar point in time. 
Also, it's happening in an overlay batch system (i.e. HTCondor startds are running as "jobs" inside another HTCondor cluster). 

So what I'd really need to do would be: 
- In the local batch system, check total disk space and job classads and disallow new jobs if the sum of all RequestDisk is close to the total space. 
- In the overlay batch system, check the RequestDisk of the "job / pilot" which contains the condor_startd, and have that START expression become false if the sum of the RequestDisk of the overlay jobs
  becomes close to the total RequestDisk of the "job / pilot". 
Furthermore, the Cron would need to run about once per negotiation cycle (at least!) to catch jobs starting at the same time. 

All this feels very much like reimplementing a consumption policy which HTCondor could take care of itself, since it tracks actual disk usage and total space already. 
Should I really program this myself with Cron scripts, or is there a better way to have HTCondor do the same it does for memory and cpu?