[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_rm & the docker universe



On 7/30/2015 2:31 PM, Brian Bockelman wrote:

On Jul 30, 2015, at 11:40 AM, Dimitri Maziuk <dmaziuk@xxxxxxxxxxxxx> wrote:

On 07/30/2015 10:01 AM, andrew.lahiff@xxxxxxxxxx wrote:
Hi Greg,

Ok, I didn't realized it worked like this - I had assumed HTCondor
would do something like "docker stop", rather than send a signal to the
actual executable running inside the container. Isn't this rather
unsafe? It makes it very easy for people to run jobs which escape
HTCondor's control - according to HTCondor the job has been killed but
the Docker container continues running for as long as it wants.


Greg can correct me if I am wrong, but I believe the signal sending is only to give the job a chance to "gracefully" shut down (vacate). After HTCondor sends the signals, it sets a timer to follow up with a docker stop. Thus nothing is allowed to continue running forever. See the manual for MachineMaxVacateTime and JobMaxVacateTime - I think the default on these is 10 minutes. So to achieve today what you stated above, I think you could submit your docker universe job with something like
  job_max_vacate_time = 2
and then HTCondor should do a docker-stop two seconds after sending the signal if the instance is still lingering. I think Greg is thinking about changing the default JobMaxVacateTime to be much smaller for docker universe than the default of 10 minutes...

regards
Todd