[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Efficiency & centralization of global information gathering?



Hi there!

For our farm we do something similar for floating licenses (e.g. flexLM or sesi for Houdini) in that we have an external process polling license servers. In our studio, licenses can be used both on the farm and off the farm where condor can't track it so it's a little more involved that just parsing for the total available license, but basically we come up with a number of how many licenses are either in use or available on the farm and write that to a condor config file on the negotiator host.

The entries in the file (named something like 99_license_limits) looks something like:

nuke_LIMIT = 1000
maya_fluid_sim_LIMIT = 200

These basically set up concurrency limits for our licenses. Jobs that will need to use a particular license specify them in their submission description files with a line like:

ConcurrencyLimit = nuke

When licenses get used outside of the farm, we adjust the values written to the 99_license_limit file. For example, if we know that 20 of our maya_fluid_sim licenses are being used outside of HTCondor, we update the config file with:

maya_fluid_sim_LIMIT = 180

There's a configuration parameter called NEGOTIATOR_READ_CONFIG_BEFORE_CYCLE that makes the negotiator reread the configuration files before each negotiation cycle so it will have the latest (for some definition of "latest") license limit values before doing any match-making.

This may be overkill for you license situation, but it seems like this could probably be used for your file server throttling. We need something similar for throttling our NFS servers.

Create a limiter for each filer like:

volume_1_LIMIT = 99999
volume_2_LIMIT = 99999

Under normal circumstances, the value is set to a number higher than the total number of job slots on your farm. When your external script detects that the filer is at capacity or otherwise overloaded, update the values to 1 (I don't remember if 0 is a valid value or not). This prevents any new jobs requiring the filer limits from starting.

Full disclosure, however, we didn't use this for very long because most users had no idea what filers their jobs would access at run time, but maybe you'll have better luck.

In any case, it sounds like you've already got an alternate solution, but just wanted to share what we did for a similar problem.

Cheers!







On Wed, Jan 4, 2017 at 4:11 PM, Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx> wrote:
Max,

Thanks for that suggestion! For my FlexLM problem, I just wrote a quick Perl script which you call like so:

    flexlm2classad lmstat -a

This runs the "lmstat -a" command (or you can feed it flexlm data on stdin), and converts it into a ClassAd that looks like so:

MyType = "Generic"
Name = "FlexLM"
FlexLM_Available_a_spaceclaim_dirmod = 6
FlexLM_Available_acfd = 3
FlexLM_Available_acfd_flo = 1
FlexLM_Available_agppi = 8
...etc...

The identifier is the feature name for each license, and it's derived from the "Users of" lines like so:

Users of a_spaceclaim_dirmod:Â (Total of 8 licenses issued;Â Total of 2 licenses in use)

Then this can be pulled out for use by a startd_cron job or what have you with:

    condor_status -any -constraint 'Name == "FlexLM"' -af:lrng FlexLM_Available_a_spaceclaim_dirmod

I noticed that in the help output for condor_advertise, there's a "MERGE_STARTD_AD" option, but it's not mentioned in the man page and it doesn't seem to let me add an attribute to an existing startd ad even if I structure it as a Query type like the invalidate commands. Maybe someone from the CHTC pantheon can enlighten us on this point.

    -Michael Pelletier.

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Fischer, Max (SCC)
Sent: Wednesday, January 04, 2017 1:59 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Efficiency & centralization of global information gathering?

Hi Michael,

I've found this to be best solved outside of Condor.

1. Have a regular cron job *somewhere* fetch the data once.
2. Provide that data via files on shared filesystems.
3. Have startd_cron read from the file.
4. ???
5. Profit

The trick is just to have 1. and 3. separate. There's no problem having 1. create proper classad already, and 3. just using cat.

Note that using files for 2. is historical laziness on my part:
You can just as well publish this information via custom ClassAds. I think condor_advertise with UPDATE_AD_GENERIC should do the trick.

Cheers,
Max
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/