[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Managing dynamic services

> On Nov 21, 2018, at 12:52 PM, Greg Thain <gthain@xxxxxxxxxxx> wrote:
> On 11/16/18 6:57 PM, Stuart Anderson wrote:
>> Is there a standard way for Condor to manage local privileged dynamic services on an execute machine depending on whether any currently matched jobs have requested it?
>> Put another way, instead of matching a job to a machine that already has a service running, what is the best way to handle individual jobs optionally requesting a service that is started by condor when a match occurs and before the condor job starts, along with Condor stopping the service after the last job requesting it has finished running?
>> The context of this question is Linux systems running systemd if there is a natural hook there.
> Stuart:
> This is an interesting question, and I think there's at least one way to do this with HTCondor, but I have a couple of questions to put a finer point on potential solutions

The background/use case for this is Kafka. In particular, if there are Condor jobs running that want to access a popular set of messages from Kafka then a per-machine service should start to receive those data and perform some common processing for all jobs on the machine. Problems to avoid: 1) do not load down the network and Kafka brokers with thousands of extra copies of messages that will not be used; 2) Avoid the inefficiency of multiple jobs on the same execute machine requesting the same messages independent of each other.

> Usually, pools would be configured so this service would be started at boot time on all machine, but HTCondor would limit the number of running jobs that use the service using some mechanism. I assume that you don't want to have the service always running because it uses a non-trivial amount of resources, even when idle.

	Correct. I am considering the case where a service has a non-trivial "idle" load and should ideally be run if and only if there are any local consumers. If my Kafka explanation isn't clear a similar analogy is IGMP snooping in ethernet switches, i.e., send multicast traffic to a port if and only if there are any clients requesting that data on that port.

>  If so, we may want to reflect this additional machine load in the startd.  Also, it seems that many jobs on the same machine from different users can use this service without too much additional load on the service.

That is my current use case, i.e., the marginal cost of additional jobs (independent of user) requesting the same service is effectively 0. If that was not true I think I understand the existing Condor knobs that would let me rank/throttle how many such jobs I would let run on an individual execute machine.

>  Do you want the service to get killed when the last job finishes, or is it ok to have some "grace time", when no jobs are running, in order to encourage a subsequent job to start?

In the Condor spirit of "the more knobs the better" adding an optional dynamic service grace time is probably a good idea in general, however, it is not needed for my current use case.

> Also, is it sufficient for Condor to start the service before starting the first job, or do we need to block the job from starting until the service is "ready"?

Ideally job start would block until systemd (or some other start mechanism) acknowledges the service reports it is running.


Stuart Anderson  anderson@xxxxxxxxxxxxxxxx