[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Efficiency & centralization of global information gathering?



From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Edward Labao

We determine what license are in use outside of the farm based on the hostnames. Our farm hosts have a standard naming convention of something like "ix10001" and "ix10002" so it's easy to parse them out of the detailed lmstat output that lists hostnames using licenses.

 

The spice must flow, after all.

 

For a disk-space concurrency limit, at first glance it appears that it would be necessary to probe the job or machine ClassAds and parse out the ConcurrencyLimits attribute of running jobs to insure that the available space number takes them into account:

                                                                                

condor_q âconstraint â! isUndefined(ConcurrencyLimits)â âaf ConcurrencyLimits

volume_1:102400

volume_1:204800

volume_1:307200

 

(Thereâs probably even a way to use ClassAd expressions with clever nested splits and stringListSum() to calculate the total number of limit units consumed by running jobs, but I havenât had enough Sapho this morning to work out the details on the fly.)

 

But Iâm not sure that adding these claims to the actual free disk space is the right approach.

 

These three jobs have requested a total of 600MB from volume_1. However, unlike the FlexLM example thereâs no easy way to precisely determine how much disk space a job is currently using out of its claim, so if you look at the volume and see 1000MB free, that might be 1000 left over after all 3 jobs have written all 600MB to their vol_1 output directories and are just about to exit, or it might be that there will only be 400MB available after all the jobs finish running and transferring their outputs. If you set the config concurrency limit to the unadjusted 1000MB free for the volume, then the negotiator will allow only jobs which claim less than 400MB on this limit to run at the outset.

 

But in the second scenario that 1000MB number will shrink going forward, and as the three jobs draw down the free disk space to 400MB, the concurrency limit will be 400MB but running jobs will  have claimed 600MB of it, and so the limit will be at -200MB and no jobs will start, even ones which need only 10MB. If you added the concurrency limit claims to the detected free disk space, then the concurrency limit would go from 1600MB to 400MB, but then you run the risk of a 500MB job being allowed to start while the current jobs are on their way to leaving only 400MB available.

 

The double-counting lasts the duration of the job, and Iâm not sure if thereâs a good way to figure out the difference between the projected use and the current use in the job. We have jobs which write directly to their output directory as well as jobs which use scratch space, so I donât think DiskUsage would be universally helpful in this situation.

 

Thanks again, Edward!

 

            -Michael Pelletier.