[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor and Docker Live-Restore



Dear Condor community,

 

I hope everyone is keeping well. At our site we have a dependency on using Docker as our containerisation technology and this layer regularly needs patching with new version updates. New Docker versions usually involve draining the execution point of jobs, patching then reintroducing the node back into a prod state. To try and reduce downtime, we’ve recently been experimenting using Docker’s Live Restore functionality (https://docs.docker.com/config/containers/live-restore/). The outcome of this testing has been mostly positive, containers remain running with no service impact while Docker is updated or restarted.

 

However, I found that the startd loses all running jobs on the execution point if Docker is restarted / updated in this life-restore fashion. This leaves the environment in a state where all containers are running and continuing to function while commands such as `condor_who` return no results. Is there a function within Condor where we can make the startd “live-restore” aware, so it maintains a list of running jobs/containers without the startd losing all running jobs?

 

Any help in this area will be gratefully received and many thanks in advance.

 

Best wishes,

 

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department  

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot 
OX11 0QX

 

signature_609518872