[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Extend Docker container removal timeout



Hi all,

 

I hope everyone is keeping well. Quick question for the community, we have intermittent timeouts for containers on nodes with the logs detailing the following:

 

condor_startd[3404]: Failed to read results from '/usr/local/bin/docker.py rm -f -v HTCJob2693132_0_slot1_8_PID2466140': 'Timed out waiting for program to exit' (110)

 

Is there a knob / config option that exists for extending the removal timeout value for containers and jobs on startd’s? Docker does eventually remove the container but as the workers have very high I/O at times, the node may need more time to supply a response to the startd. We’re running Condor 9.0.15 across the pool.

 

Many thanks in advance,

 

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department  

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot 
OX11 0QX

 

signature_609518872