[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Disk space consumability



Am 18.11.19 um 22:42 schrieb Todd Tannenbaum:
> On 11/16/2019 12:52 PM, Oliver Freyermuth wrote:
>> Hi together,
>>
>> we are running into issues with some jobs requiring a lot of disk space, making our execute directories overflow.
>> Those jobs are requesting the necessary disk space via Request_Disk correctly, but the problem arises when multiple of these jobs arrive on a single node (via partitionable slots)
>> since HTCondor does not regard disk space as consumable (even though it is consumed, of course).
>>
>> Does somebody have a good solution at hand for this issue? Is there a hidden knob to make disk space consumable?
>>
>> Cheers,
>> 	Oliver
>>
> 
> Hi Oliver,
> 
> What version of HTCondor are you using?
> 
> Not sure what you mean by "HTCondor does not regard disk space as consumable...", since at least for me with HTCondor v8.8+ with partitionable slots, when a dslot is created with Disk=X, then the partitionable slot has its Disk attribute reduced by X.

Dear Todd,

you are - right, as expected! 
I've fallen into a series of traps:
- We started out with an 8.6 release affected by the issues you have linked, and at that point in time, I heard (or read?) about the issue that disk space was not corretly limited / "consumed" with pslots,
  and we also saw this, but it did not hit us heavily (yet). 
- When I set up our initial monitoring and did my first steps with HTCondor, I've mistakenly used 
  TotalSlotDisk (thinking this would be the disk space for each slot) and TotalDisk (as the actual total)
  instead of looking at TotalDisk vs. Disk, which I should have done... 
  So I presumed I monitored the actual issue, but I just looked at the wrong attributes. 
- And finally, the issue only hit us hard(er) since we are now letting condor_startd's run in "pilots" on our cluster,
  and we did not set "Disk" accordingly inside those pilots, so each startd thought it could use the full disk. 

I *think* I also saw some jobs overcommitting storage, but that might be because our monitoring was broken. 

So now I have a series of things to fix :-). 
Many thanks for pointing me in the right direction and showing me HTCondor is doing exactly what I want :-). 

Cheers and thanks,
	Oliver


> 
> In other words, on my laptop with ~250GB of free disk space, when I submit the following job 10 jobs, only one job will run at a time as you would hope:
>    
>    executable = c:\utils\sleep.exe
>    arguments = 30
>    transfer_executable = false                   
>    request_cpus = 1                              
>    request_memory = 20                           
>    request_disk = 200GB                          
>    queue 10                                      
> 
> And periodically running condor_status I see the Disk space in the pslot decrease as expected when the dslot is created:
> 
> Î condor_status -server
> Name                   OpSys       Arch   LoadAv Memory   Disk      
> slot1@TODDS480S        WINDOWS     X86_64  0.000   16217  244047488
> 
> [then once a job is running]
> 
> Î condor_status -server
> Name                   OpSys       Arch   LoadAv Memory   Disk      
> slot1@TODDS480S        WINDOWS     X86_64  0.000    16089  34166649
> slot1_1@TODDS480S      WINDOWS     X86_64  0.000      128 209880840
> 
> It looks like you will want to be running HTCondor v8.6.11 or newer for this to work
> properly with partitionable slots, and make sure you did not redefine the 
> condor_config knob STARTD_RECOMPUTE_DISK_FREE away from its default value of false.
> 
> Some developer wisdom/notes on all this is at
>   https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6301
> and the derived tickets #6424 and #6614.
>    
> Hope the above helps
> Todd
>   
> 


-- 
Oliver Freyermuth
UniversitÃt Bonn
Physikalisches Institut, Raum 1.047
NuÃallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature