Re: [HTCondor-users] Efficiency & centralization of global information gathering?

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

I guess the suitability of a given approach, as always, depends on oneâs priorities. J

Iâm glad youâll find that scrap of Perl useful! Weâve got the same sort of opportunistic resources in one of our pools here too, so I was motivated to whip something up for it.

With respect to disk space, I suppose it should be possible to have an update_job_info hook that could do a quick assessment of the jobâs current disk space utilization in its scratch and its final output location (if on a shared or otherwise accessible filesystem as opposed to an output_destination URL), and update a job ClassAd once every five minutes (per STARTER_UPDATE_INTERVAL) with that information, and then you could incorporate that attribute of the job into the concurrency limit for the disk volume, with any necessary adjustments based on the state of DiskUsage for output-transfer jobs, and RequestDisk. This could also allow better handling of jobs which have exceeded their RequestDisk amount.

I reckon that a âduâ on a given directory in a separate hook process once every few minutes will be quicker than a SystemTap and 10-15% drag on the primary process. Of course, if youâve got some other update_job_info or other hooks on the job, you wind up with a bit of extra work.

Speaking of runtime updates, I wonder if itâs possible to claim a concurrency limit by updating the running jobâs ClassAd to specify one?

For example, if your job doesnât need a license checkout until four hours in to a six hour run, can a hook â or a step in the jobâs script â check/wait for an available limit and then chirp a ConcurrencyLimits string into its own ad such that the negotiator will recognize it as claimed? As far as I can tell without looking at the C++ the negotiator does roughly the same thing as the Perl script when assessing the current status of the limits. This may be easier to deploy in some situations than to convert a job into a DAG.

And thanks for that NVL tip â itâs a bit of trick to keep the distinctions and overlap between ClassAd, config, and submit description syntax straight sometimes, especially when coffee-deprived.

-Michael Pelletier.

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Edward Labao
Sent: Monday, January 09, 2017 10:25 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Efficiency & centralization of global information gathering?

Hi Michael,

I'm afraid you're right. Concurrency limits seem to fall flat when 1) what you really want is a resource reservation mechanism and 2) jobs are making permanent changes to the resource. Without some really good instrumentation on the jobs or filers, there's no way to accurately know how much of that space was written to by jobs in the farm as opposed to processes out of the farm. And as you've mentioned, without that information, you either end up double counting -resource consumption for the life of the job and under-utilizing your resource, or if you don't try to adjust, you end up with over-allocation.

I've looked into instrumenting farm jobs to collect i/o information at the kernel level using SystemTap, but that introduced too much overhead and resulted in 10-15% longer runtimes in my tests. I'm not sure if there's a good way to handle diskspace reservations for shared filesystems in HTCondor without a significant amount of extra engineering.

Your perl script would be really useful for us since what I didn't mention before was that at night, our user desktop systems automatically get added to the farm pool (and removed again in the morning). Having a more deterministic way of knowing what licenses are being used by jobs would be really useful. Unfortunately, our license servers are somewhat isolated VMs and don't have HTCondor installed. :( Fortunately, we don't get too many "out of license" errors and we have systems in place to automatically retry jobs that encounter them.

We use a lot of ifThenElse expressions in our configs, and I believe you can create NVL-like syntax in HTCondor configs with them like:

SOME_PARAM = ifThenElse((SOME_OTHER_PARAM=!=UNDEFINED), SOME_OTHER_PARAM, 1)

If SOME_OTHER_PARAM is defined, SOME_PARAM will be assigned its value. If SOME_OTHER_PARAM is not defined, SOME_PARAM=1.

On Mon, Jan 9, 2017 at 3:58 PM, Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx> wrote:

Hereâs an alternative to matching hostnames in lmstat output:

#!/usr/bin/perl

my $running_claims = qx( condor_q -constraint 'JobStatus == 2 && ! isUndefined(ConcurrencyLimits)' -format '%v' 'split(ConcurrencyLimits, ", ")' );

my @running_claims = $running_claims =~ ( m{"([^"]+)"}g );

my %limit;

for (@running_claims) {

    my ($name, $count) = split(':');

    $count = length($count) ? $count : 1;

    $limit{$name} += $count;

}

for $key (keys(%limit)) {

    print "${key}_condor_used = $limit{$key}\n";

}

Then you can set the limit like so:

App_license_lmstat_available = <pulled from lmstat>

App_license_condor_used = <pulled from condor_q above>

App_license_limit = $(app_license_lmstat_available) + $(app_license_condor_used)

Iâm trying to remember if thereâs a mechanism in the configuration where you can say â$(value:0) and get 0 if value is undefined, rather than using an if/endif block but Iâm not finding it offhand. So whateverâs generating the config file would need to take that into account since the code above will only produce outputs for limits which are in current use..

                -Michael Pelletier.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] Efficiency & centralization of global information gathering?