[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor and docker - advise for a newbie



On Mon, Dec 11, 2017 at 4:17 PM, Ian McEwen <mian@xxxxxxxxxxx> wrote:
> On Mon, Dec 11, 2017 at 12:40:56PM -0500, Larry Martell wrote:
>> Just getting stared with condor and I am looking for some guidance.
>>
>> Currently I have 2 docker containers that are linked to each other.
>> One has a crontab that runs many jobs throughout the day. Many of
>> these jobs are multithreaded and/or fork off other processes. The jobs
>> require services from both its own and the other container. My goal is
>> to use HTCondor to distribute these jobs, threads, and forked
>> processes across multiple machines. From reading the docs I think I
>> need the docker universe for this. Is that correct? But how can I have
>> condor start up both containers? It is possible to already have the
>> containers running on the remote hosts and have condor invoke the jobs
>> inside them?
>>
>
> Hello!
>
> I believe that the docker universe is probably unsuitable for this use
> case, but it should be possible to do what you want by way of vanilla
> universe jobs -- with the caveat that HTCondor's resource tracking will
> probably not work as you expect. It may also be possible to run HTCondor
> startds within your existing containers as a way of scheduling jobs to
> them.
>
> First, re: the docker universe. By design, it does not expose every
> potential feature of Docker; it's designed to be a way of specifying an
> environment to run a job in, and a way to isolate that job from the
> surrounding host, and not really more. Notably for your use case, it
> does not (as far as I'm aware) support docker's links or networking
> features, nor would it allow running jobs inside an already-running
> container. Basically, it's a good way to specify that you want the job
> to run on Debian with X, Y, and Z packages installed, but not to specify
> connected network resources, other processes, etc.
>
> On to the parts which might help solve your case:
>
> * Use the vanilla universe, but sacrifice HTCondor's resource tracking:
>   You can run a vanilla universe job and write a script that calls out
>   to 'docker run', 'docker exec', etc., so long as the user the job will
>   run as is allowed to run docker. If you wanted to have the job start
>   up the prerequisite containers, it could do so in the script, or you
>   could set up your nodes to have the containers already running and
>   then use 'docker exec' to run things within the containers. However,
>   only the actual 'docker run' or 'docker exec' process (and thus not
>   the containers themselves or the processes being run within them) will
>   fall within HTCondor's jurisdiction, due to how Docker works. There's
>   some funny potential ways to change this which probably aren't that
>   advisable unless you're really attached to having HTCondor's resource
>   tracking work as expected. (Specifically, if anyone needs to go down
>   this road: with 'docker run' you can pass a cgroup parent, so with
>   HTCondor cgroup-based tracking you can determine the parent script's
>   cgroup (the htcondor-created one) and pass it as the parent to the
>   docker container. However, you need to also pass down the resource
>   constraints, probably slightly smaller than the slot -- if not, the
>   wrapper script will get killed off but the container will persist,
>   from the testing of this approach I've done)
> * Run a startd inside the container:
>   Instead of using a script from outside the container to run things
>   within the container, you could instead run HTCondor itself inside a
>   container where the environment you want is available, and have your
>   jobs be routed there. To do so, you'd need to construct an appropriate
>   configuration file -- most likely, you would turn on the shared port
>   daemon, expose its port to the outside world when running the docker
>   container, and use TCP_FORWARDING_HOST to specify the surrounding
>   host's IP as the appropriate place to connect to. If you're running
>   more than these jobs in your HTCondor cluster, you'll probably want to
>   add a STARTD_ATTR
>   (http://research.cs.wisc.edu/htcondor/manual/current/3_5Configuration_Macros.html#22879)
>   which identifies these special slots as inside the docker container,
>   and add that as a requirement on your job, and set up the START
>   expression of these slots to refuse jobs which don't explicitly
>   request them.
>
> Hopefully what I'm saying makes sense. The first option is most likely
> easier to implement, and the second is arguably cleaner but more finicky
> to set up.

Thanks so much Ian for the very detailed reply. My central manager
machine has 24 processors, but the 2 machines I want to distribute
jobs across have 176 each. I want to take advantage of all this CPU
power and run as many threads and forked processors as possible. Given
that, what configuration would you recommend?