[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Disk space consumability



Dear Vincent,

thanks for the idea!
Indeed we are already doing this, marking a node as "unhealthy" (i.e. START expression becomes false) if the usage of the pool directory exceeds 80 %. 
This is not sufficient for the present case, though: Thos jobs are running for a few hours, and slowly filling up local scratch in the order of 100 GB,
and may start at a similar point in time. 
Also, it's happening in an overlay batch system (i.e. HTCondor startds are running as "jobs" inside another HTCondor cluster). 

So what I'd really need to do would be: 
- In the local batch system, check total disk space and job classads and disallow new jobs if the sum of all RequestDisk is close to the total space. 
- In the overlay batch system, check the RequestDisk of the "job / pilot" which contains the condor_startd, and have that START expression become false if the sum of the RequestDisk of the overlay jobs
  becomes close to the total RequestDisk of the "job / pilot". 
Furthermore, the Cron would need to run about once per negotiation cycle (at least!) to catch jobs starting at the same time. 

All this feels very much like reimplementing a consumption policy which HTCondor could take care of itself, since it tracks actual disk usage and total space already. 
Should I really program this myself with Cron scripts, or is there a better way to have HTCondor do the same it does for memory and cpu? 

Cheers,
	Oliver

Am 17.11.19 um 12:00 schrieb Vikrant Aggarwal:
> Hello,
> 
> You may disable the node to not accept new jobs if filesystem utilization reaches certain percentage.Â
> 
> Use the cron feature of condor to continuously check the available space in filesystem and set the job class AD accordingly and use the expression in STARTÂ condition to not accept more jobs on the node.Â
> 
> Thanks & Regards,
> Vikrant Aggarwal
> 
> 
> On Sun, Nov 17, 2019 at 12:23 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
> 
>     Hi together,
> 
>     we are running into issues with some jobs requiring a lot of disk space, making our execute directories overflow.
>     Those jobs are requesting the necessary disk space via Request_Disk correctly, but the problem arises when multiple of these jobs arrive on a single node (via partitionable slots)
>     since HTCondor does not regard disk space as consumable (even though it is consumed, of course).
> 
>     Does somebody have a good solution at hand for this issue? Is there a hidden knob to make disk space consumable?
> 
>     Cheers,
>     Â Â Â Â Oliver
> 
>     _______________________________________________
>     HTCondor-users mailing list
>     To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>     subject: Unsubscribe
>     You can also unsubscribe by visiting
>     https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
>     The archives can be found at:
>     https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 



Attachment: smime.p7s
Description: S/MIME Cryptographic Signature