[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Issue: TotalDisk is not the current amount of the free disk space on the machines

On 2/25/21 1:12 PM, Todd Tannenbaum wrote:
On 2/15/2021 7:31 PM, Carlos Luque wrote:
Hello all,

ÂÂÂ I'm addressing an issue about the current free disk space detected by the daemon condor_startd. The condor version is 8.8.11 running GNU/Linux

I checked the amount of disk space on the execute machines is less than the current disk space and/or vice versa. For example, in a machine the TotalDisk is 4529828 KiB, but the current amount of disk space is 74357772 KiB. In another case, the amount of disk space is 4 KiB and the TotakDisk detected is 54742440 KiB. None of machines was running any job during the checking.

Hi Carlos,

HTCondor manages the disk space for job scratch directories. These directories are created in the subdirectory specified by the EXECUTE config knob (usually /var/lib/condor/execute). HTCondor assumes that it is the only service using disk space on the volume where the EXECUTE directory lives (enter "condor_config_val execute" to see that path). If you have other services or users running on your nodes that can use up significant disk space on the same volume where the EXECUTE directory lives, it could cause problems.

Here at the University of Wisconsin, for example, our execute nodes have a separate disk partition for EXECUTE for exclusive use by HTCondor.

When the HTCondor service is started (specifically, when the condor_startd launches), it examines the free disk space on the volume where EXECUTE lives and publishes that as TotalDisk. In other words, at startup it does the equal of setting TotalDisk to:
ÂÂ df -k --output=avail `condor_config_val _execute`
HTCondor then assumes the available disk it discovered at startup what it should manage. If something other than HTCondor consumes a lot of space, or frees a lot of space, on the disk volume where EXECUTE lives after HTCondor is started, that could explain the behavior you see above.

If you are using static slots, you could try putting the following in the config:

 # Tell the condor_startd to periodically (every ~10 min) update TotalDisk
 # based on available space on the EXECUTE volume. If this setting is
 # switched back to False (which is the default), then the startd only
 # sets TotalDisk once at startup.

Setting STARTD_RECOMPUTE_DISK_FREE to True is not recommended with partitionable slots. And to be honest, no matter what you do, if disk space is tight enough that you need it carefully managed, then you need to ensure nothing else besides jobs managed by HTCondor is reading/writing files on the EXECUTE disk partition.

More below...

Hello Todd,

ÂÂÂ Thanks so much for your reply. The information is very valuable and now I can understand the behavior of the variable TotalDisk.

Our machines have a large amount of hard disk space, but the user applications use a large amount of hard disk space.

Is it possible to increase the update period? one hour or one day? Every 10 min is far too short a period for our proposal.

Moreover, the explanation of the 'Disk' attribute says 23000 = 23MiB in the section Machine ClassAd attribute. Is it kiB or kB for the attribute Disk ?

It is the number of bytes divided by 1024. So by the ISO 8000 standard it is KiB, and by the JEDEC standard it is KB.

Could someone give me some hints to figure out this issue about the amount of the free space in the TotalDisk?

Thanks in advanced.

Hope the above helps,

Best regards,

Carlos Luque

Carlos Luque
Postdoc researcher - EuroCC - Specialized Informatics Services
Instituto de AstrofÃsica de Canarias (IAC)
C/ VÃa LÃctea, s/n - 38200 - La Laguna, Tenerife, Spain
Tel: +34 922 605 200 Ext. 5547
EuroCC Spain: http://eurocc-spain.res.es
EuroCC: https://www.eurocc-access.eu
SIE-IAC: http://research.iac.es

AVISO LEGAL: Este mensaje puede contener informaciÃn confidencial y/o privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido por error, por favor notifÃquelo al remitente inmediatamente. Cualquier uso no autorizadas del contenido de este mensaje està estrictamente prohibida. MÃs informaciÃn en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged information. If you are not the final recipient or have received it in error, please notify the sender immediately. Any unauthorized use of the content of this message is strictly prohibited. More information:  https://www.iac.es/en/disclaimer