[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Issue: TotalDisk is not the current amount of the free disk space on the machines

On 2/15/2021 7:31 PM, Carlos Luque wrote:
Hello all,

    I'm addressing an issue about the current free disk space detected by the daemon condor_startd. The condor version is 8.8.11 running GNU/Linux

I checked the amount of disk space on the execute machines is less than the current disk space and/or vice versa. For example, in a machine the TotalDisk is 4529828 KiB, but the current amount of disk space is  74357772 KiB. In another case, the amount of disk space is 4 KiB and the TotakDisk detected is  54742440 KiB. None of machines was running any job during the checking.

Hi Carlos,

HTCondor manages the disk space for job scratch directories.  These directories are created in the subdirectory specified by the EXECUTE config knob (usually /var/lib/condor/execute).  HTCondor assumes that it is the only service using disk space on the volume where the EXECUTE directory lives (enter "condor_config_val execute" to see that path).  If you have other services or users running on your nodes that can use up significant disk space on the same volume where the EXECUTE directory lives, it could cause problems. 

Here at the University of Wisconsin, for example, our execute nodes have a separate disk partition for EXECUTE for exclusive use by HTCondor.

When the HTCondor service is started (specifically, when the condor_startd launches), it examines the free disk space on the volume where EXECUTE lives and publishes that as TotalDisk.  In other words, at startup it does the equal of setting TotalDisk to:
   df -k --output=avail `condor_config_val _execute`
HTCondor then assumes the available disk it discovered at startup what it should manage. If something other than HTCondor  consumes a lot of space, or frees a lot of space, on the disk volume where EXECUTE lives after HTCondor is started, that could explain the behavior you see above.

If you are using static slots, you could try putting the following in the config:

  # Tell the condor_startd to periodically (every ~10 min) update TotalDisk
  # based on available space on the EXECUTE volume.  If this setting is
  # switched back to False (which is the default), then the startd only
  # sets TotalDisk once at startup.

Setting STARTD_RECOMPUTE_DISK_FREE to True is not recommended with partitionable slots.  And to be honest, no matter what you do, if disk space is tight enough that you need it carefully managed, then you need to ensure nothing else besides jobs managed by HTCondor is reading/writing files on the EXECUTE disk partition. 

More below...

Moreover, the explanation of the 'Disk' attribute says 23000 = 23MiB in the section Machine ClassAd attribute. Is it kiB or kB for the attribute Disk ?

It is the number of bytes divided by 1024.  So by the ISO 8000 standard it is KiB, and by the JEDEC standard it is KB. 

Could someone give me some hints to figure out this issue about the amount of the free space in the TotalDisk?

Thanks in advanced.

Hope the above helps,