[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Efficiency & centralization of global information gathering?



Wow, thanks for the tip on NEGOTIATOR_READ_CONFIG_BEFORE_CYCLE!  I had thought about trying to tweak concurrency limits in the config but wound up stymied by my lack of awareness of that parameter. Itâs funny, apparently thatâs been available since I started using HTCondor 7.8 in 2013 and Iâd never noticed it, or at least noticed it sufficiently to recognize its usefulness.

 

I like the fact that it meshes with the concurrency limits which weâd want people to use for licenses anywayâ it actually seems like itâs the âcorrectâ way to handle license allocations rather than a startd attribute. I had been puzzling over how often to update the machine ad with the license counts to be sure to catch the negotiation cycle, which is not necessary with this read-config parameter.

 

It looks like when youâre tweaking the available license count concurrency limit, youâd have to omit licenses checked out to HTCondor jobs, otherwise theyâd be double-counted â do you recognize these with hostname or some such, or just add the current-running concurrency limit count to the available count?


For the volume size limits, I think that the fact that the values have much more âstretchiness,â as it were, than application licenses, feeding it into the machine ad would still be a feasible approach and be aligned with the use of requirements and the start expressions for disk space, however it does still suffer the limitation that itâs just a point-in-time snapshot when it comes to negotiation â if you had 100 jobs all look at the current available space and see it above the threshold, but that threshold could only support 50 sets of output, they might all start at the same time and crush the volume anyway. Using a concurrency limit instead would take the running jobsâ output space requests into account:

 

Concurrency_limits = volume_1:5242880

 

Youâd define the limit using kilobytes to match the units of request_disk, and this job would claim 5GB out of the volume_1 space. This value would probably be lower than the request_disk number for jobs which use scratch space and output transfers â some jobs we run have a scratch space high-water mark thatâs nearly double the final output. If output is compressed before output-transfer, thatâd also reduce the concurrency limit request.

 

Iâve got my pool-wide configuration set up in a shared NFS path in /home/condor/config/config.d, so itâd be easy enough to drop the limit data in there, but I suppose the only place it would be needed is on the CM server, so it could just go into /etc/condor/config.d â is that the approach you take? Thereâs no need for any other daemons to look at <NAME>_LIMIT values, right?

 

I think in most of our pools weâd have a good bead on what the target volumes will be â most of them donât have a wide variety of users and so the output typically goes to the same big volume every time, so weâre okay there. Maybe in your case you could cook up a submit _expression_ that looked at the Iwd or output directory of the job with a regexp, and if it spotted a covered filesystem it would apply the limit:

 

Concurrency_limit = ifThenElse(regexp(Iwd, â^/volume_1â), concat(âvolume_1:â, $(request_disk)), ââ)

 

Probably not really practical (or syntactically correct), but at least itâs an interesting example of the power of ClassAd expressions. J

 

                -Michael Pelletier.

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Edward Labao
Sent: Wednesday, January 04, 2017 8:31 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Efficiency & centralization of global information gathering?

 

Hi there!

For our farm we do something similar for floating licenses (e.g. flexLM or sesi for Houdini) in that we have an external process polling license servers. In our studio, licenses can be used both on the farm and off the farm where condor can't track it so it's a little more involved that just parsing for the total available license, but basically we come up with a number of how many licenses are either in use or available on the farm and write that to a condor config file on the negotiator host.

The entries in the file (named something like 99_license_limits) looks something like:

nuke_LIMIT = 1000

maya_fluid_sim_LIMIT = 200

These basically set up concurrency limits for our licenses. Jobs that will need to use a particular license specify them in their submission description files with a line like:

ConcurrencyLimit = nuke

When licenses get used outside of the farm, we adjust the values written to the 99_license_limit file. For example, if we know that 20 of our maya_fluid_sim licenses are being used outside of HTCondor, we update the config file with:

maya_fluid_sim_LIMIT = 180

There's a configuration parameter called NEGOTIATOR_READ_CONFIG_BEFORE_CYCLE that makes the negotiator reread the configuration files before each negotiation cycle so it will have the latest (for some definition of "latest") license limit values before doing any match-making.

This may be overkill for you license situation, but it seems like this could probably be used for your file server throttling. We need something similar for throttling our NFS servers.

Create a limiter for each filer like:

volume_1_LIMIT = 99999

volume_2_LIMIT = 99999

Under normal circumstances, the value is set to a number higher than the total number of job slots on your farm. When your external script detects that the filer is at capacity or otherwise overloaded, update the values to 1 (I don't remember if 0 is a valid value or not). This prevents any new jobs requiring the filer limits from starting.

Full disclosure, however, we didn't use this for very long because most users had no idea what filers their jobs would access at run time, but maybe you'll have better luck.

In any case, it sounds like you've already got an alternate solution, but just wanted to share what we did for a similar problem.

Cheers!

 

 

 

 

On Wed, Jan 4, 2017 at 4:11 PM, Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx> wrote:

Max,

Thanks for that suggestion! For my FlexLM problem, I just wrote a quick Perl script which you call like so:

        flexlm2classad lmstat -a

This runs the "lmstat -a" command (or you can feed it flexlm data on stdin), and converts it into a ClassAd that looks like so:

MyType = "Generic"
Name = "FlexLM"
FlexLM_Available_a_spaceclaim_dirmod = 6
FlexLM_Available_acfd = 3
FlexLM_Available_acfd_flo = 1
FlexLM_Available_agppi = 8
...etc...

The identifier is the feature name for each license, and it's derived from the "Users of" lines like so:

Users of a_spaceclaim_dirmod:  (Total of 8 licenses issued;  Total of 2 licenses in use)

Then this can be pulled out for use by a startd_cron job or what have you with:

        condor_status -any -constraint 'Name == "FlexLM"' -af:lrng FlexLM_Available_a_spaceclaim_dirmod

I noticed that in the help output for condor_advertise, there's a "MERGE_STARTD_AD" option, but it's not mentioned in the man page and it doesn't seem to let me add an attribute to an existing startd ad even if I structure it as a Query type like the invalidate commands. Maybe someone from the CHTC pantheon can enlighten us on this point.

        -Michael Pelletier.

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Fischer, Max (SCC)
Sent: Wednesday, January 04, 2017 1:59 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Efficiency & centralization of global information gathering?

Hi Michael,

I've found this to be best solved outside of Condor.

1. Have a regular cron job *somewhere* fetch the data once.
2. Provide that data via files on shared filesystems.
3. Have startd_cron read from the file.
4. ???
5. Profit

The trick is just to have 1. and 3. separate. There's no problem having 1. create proper classad already, and 3. just using cat.

Note that using files for 2. is historical laziness on my part:
You can just as well publish this information via custom ClassAds. I think condor_advertise with UPDATE_AD_GENERIC should do the trick.

Cheers,
Max

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/