[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Extend Docker container removal timeout



Hi Thomas. 
With my experience it's all about shared storage that not responding. Sometimes memory corruption killing process related to the storage subsystem. Have a look at dmesg. 

Thanks 
David. 


Get Outlook for Android


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Friday, May 26, 2023, 11:04
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Extend Docker container removal timeout

Hi all,

 

Thank you for the responses. We don’t explicitly set any timeouts within Condor or our Docker wrapper. I managed to see the error occur in real time and did a better capture of the logs this time around. I do see the statement "Declaring a hung docker", please find below:

 

May 25 16:34:40 lcg2256.gridpp.rl.ac.uk condor_starter[1318016]: condor_read(): timeout reading 1 bytes from Docker Socket.

May 25 16:34:40 lcg2256.gridpp.rl.ac.uk condor_starter[1004513]: condor_read(): timeout reading 1 bytes from Docker Socket.

May 25 16:34:40 lcg2256.gridpp.rl.ac.uk condor_starter[252549]: condor_read(): timeout reading 1 bytes from Docker Socket.

May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: Failed to read results from '/usr/local/bin/docker.py rm -f -v HTCJob3189095_0_slot1_38_PID2292872': 'Timed out waiting for program to exit' (110)

May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: Declaring a hung docker

May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: DockerAPI::rm returned docker_hung. Taking Docker universe offline

May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: OfflineUniverses = {"Docker"}

May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: slot1_38: State change: starter exited

May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: slot1_38: Changing activity: Busy -> Idle

May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_procd[2695]: PROC_FAMILY_UNREGISTER_FAMILY

May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_procd[2695]: unregistering family with root pid 2292872

May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.

May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_procd[2695]: PROC_FAMILY_GET_USAGE for pid 2325658

May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_procd[2695]: PROC_FAMILY_GET_USAGE for pid 2380875

 

The Docker daemon does (eventually) recover, is there some recovery process within Condor that can clear the “DockerOffline*” ClassAds when the system is stable again? Needless to say I’m continuing to find the source of our Docker instance becoming unavailable.

 

Many thanks,

 

Tom

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jose Caballero <jcaballero.hep@xxxxxxxxx>
Date: Friday, 26 May 2023 at 08:29
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Extend Docker container removal timeout

Hi Cole,

 

Our wrapper script is mostly legacy. It was created many years ago, and I believe the native support for containers in HTCondor was not as sophisticated as it has become recently. 

It mostly sets some parameters for the docker commands, in particular env vars for specific users. Do I understand all of that can now be done using classads?

 

Said that, and Tom can correct me if I'm wrong, I don't believe we have any timeout in that wrapper script. Not explicitly at least. 

 

Cheers,

Jose

 

 

 

 

 

El jue, 25 may 2023 a las 23:26, Cole Bollig via HTCondor-users (<htcondor-users@xxxxxxxxxxx>) escribió:

Hi Thomas,

 

Looking at the log message posted it seems like the docker command is getting hung and the wrapper script /usr/local/bin/docker.py is the one declaring the timeout and exiting a failure. If it was condor declaring the timeout you would seem some sort of message like provided followed immediately by "Declaring a hung docker". 

 

What is the reason for using a wrapper python script around the docker commands?

 

-Cole Bollig


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, May 23, 2023 5:55 AM
To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Cc: Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx>
Subject: [HTCondor-users] Extend Docker container removal timeout

 

Hi all,

 

I hope everyone is keeping well. Quick question for the community, we have intermittent timeouts for containers on nodes with the logs detailing the following:

 

condor_startd[3404]: Failed to read results from '/usr/local/bin/docker.py rm -f -v HTCJob2693132_0_slot1_8_PID2466140': 'Timed out waiting for program to exit' (110)

 

Is there a knob / config option that exists for extending the removal timeout value for containers and jobs on startd’s? Docker does eventually remove the container but as the workers have very high I/O at times, the node may need more time to supply a response to the startd. We’re running Condor 9.0.15 across the pool.

 

Many thanks in advance,

 

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department  

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot 
OX11 0QX

 

signature_609518872

 

 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/