[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Submitting to a remote condor queue



On 08/02/2016 16:37, Todd Tannenbaum wrote:
Consider an alternative similar to the following:

Do a volume mount so that your container and your host share some subdirectory on the host file system. In this subdirectory, create a "runme" directory where your container will atomically write out DAG files along with their corresponding submit files. Meanwhile on the host schedd, have a local universe job (submitted by whatever user you choose) that peridoically scans the "runme" directory for .dag files, submits them, and then renames the submitted .dag file to .dag.submitted.jobX.Y.

This way you do not need to do any reconfiguration of your host schedd, you don't need to have any trust relationships between your container and your host schedd, and you don't need to pay any extra overhead of having HTCondor move files in and out of the container via file transfer.

Thank you - I was sort-of coming to that conclusion myself. Indeed, if the container only wants to fire-and-forget a single DAG, I can just run the container and wait for it to exit; if it exits with success (rc=0) then I pick up and submit the .dag file that it wrote. This does mean that the container is only responsible for the preparation of the job, not for managing its lifecycle. So for example, it can't do any post-processing actions when the DAG completes.

Alternatively, if I submit jobs in the way you suggest (by polling for drop files), then the container can carry on running and itself can check for DAG completion, e.g. by polling the node status file. It seems a pretty crude to use the filesystem in this way, instead of proper condor APIs, but it should be functional.

Either way, I need to allocate working directories outside the container, and have some external system which tracks which DAGs are running and deletes the working directories afterwards.

If the container itself were responsible for running the DAG then each container would *be* the working directory, and "docker ps" would be my list of running tasks. But having read a bit more about remote submission, I see there are a number of difficulties with this. Foremost seems to be that the jobs that condor_dagman submits would need to be able to write their outputs inside the container - which in turn I think means condor_dagman itself would have to run inside the container, and that would be a very non-standard way of deploying htcondor, unless you also had a schedd running inside the container.

Thanks again,

Brian.