[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Extend Docker container removal timeout



Hi Thomas,

Looking at the log message posted it seems like the docker command is getting hung and the wrapper script /usr/local/bin/docker.py is the one declaring the timeout and exiting a failure. If it was condor declaring the timeout you would seem some sort of message like provided followed immediately by "Declaring a hung docker". 

What is the reason for using a wrapper python script around the docker commands?

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, May 23, 2023 5:55 AM
To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Cc: Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx>
Subject: [HTCondor-users] Extend Docker container removal timeout
 

Hi all,

 

I hope everyone is keeping well. Quick question for the community, we have intermittent timeouts for containers on nodes with the logs detailing the following:

 

condor_startd[3404]: Failed to read results from '/usr/local/bin/docker.py rm -f -v HTCJob2693132_0_slot1_8_PID2466140': 'Timed out waiting for program to exit' (110)

 

Is there a knob / config option that exists for extending the removal timeout value for containers and jobs on startd’s? Docker does eventually remove the container but as the workers have very high I/O at times, the node may need more time to supply a response to the startd. We’re running Condor 9.0.15 across the pool.

 

Many thanks in advance,

 

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department  

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot 
OX11 0QX

 

signature_609518872