[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_rm & the docker universe



H Todd,

It doesn't seem to me that HTCondor actually does the "docker stop" after 10 minutes. Here is an example where after 10 minutes, the job has been stopped according to HTCondor (*):

[root@vm168 ~]# condor_history 136.0
 ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED   CMD            
 136.0   alahiff         7/30 21:07   0+00:10:46 X         ???  ./wrapper.sh   

but the container is still running:

[root@vm168 ~]# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
6a37981789b2        centos:6            "./wrapper.sh"      12 minutes ago      Up 12 minutes                           HTCJob136_0_slot1_2_PID32265 

With "job_max_vacate_time = 2" the same thing happens but much quicker.

So at least for me the container is allowed to run forever if it wants, without HTCondor's knowledge.

Thanks,
Andrew.

(*)
000 (136.000.000) 07/30 21:07:37 Job submitted from host: <x.y.z.t:47771?addrs=x.y.z.t-47771>
...
001 (136.000.000) 07/30 21:07:38 Job executing on host: <x.y.z.t:60021?addrs=x.y.z.t-60021>
...
004 (136.000.000) 07/30 21:18:23 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        29  -  Run Bytes Received By Job
        Partitionable Resources :    Usage  Request Allocated
           Cpus                 :                 1         1
           Disk (KB)            :        9        2   1890179
           Memory (MB)          :                 1         1
...
009 (136.000.000) 07/30 21:18:23 Job was aborted by the user.
        via condor_rm (by user alahiff)
...


________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Todd Tannenbaum [tannenba@xxxxxxxxxxx]
Sent: Thursday, July 30, 2015 8:45 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] condor_rm & the docker universe

On 7/30/2015 2:31 PM, Brian Bockelman wrote:
>
>> On Jul 30, 2015, at 11:40 AM, Dimitri Maziuk <dmaziuk@xxxxxxxxxxxxx> wrote:
>>
>> On 07/30/2015 10:01 AM, andrew.lahiff@xxxxxxxxxx wrote:
>>> Hi Greg,
>>>
>>> Ok, I didn't realized it worked like this - I had assumed HTCondor
>> would do something like "docker stop", rather than send a signal to the
>> actual executable running inside the container. Isn't this rather
>> unsafe? It makes it very easy for people to run jobs which escape
>> HTCondor's control - according to HTCondor the job has been killed but
>> the Docker container continues running for as long as it wants.
>>

Greg can correct me if I am wrong, but I believe the signal sending is
only to give the job a chance to "gracefully" shut down (vacate).  After
HTCondor sends the signals, it sets a timer to follow up with a docker
stop.  Thus nothing is allowed to continue running forever.  See the
manual for MachineMaxVacateTime and JobMaxVacateTime - I think the
default on these is 10 minutes.  So to achieve today what you stated
above, I think you could submit your docker universe job with something like
   job_max_vacate_time = 2
and then HTCondor should do a docker-stop two seconds after sending the
signal if the instance is still lingering.  I think Greg is thinking
about changing the default JobMaxVacateTime to be much smaller for
docker universe than the default of 10 minutes...

regards
Todd
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/