Re: [HTCondor-users] Docker hang re-evaluation

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi Todd, David

Thank you for the information. I agree that serializing the Docker requests would help alleviate the pressure on the daemon. I also believe that David’s point of a race condition existing with an exiting container and the process holding a lock on the process may be valid as well. I have on one or two occasions seen a container in an exited state but not removed, potentially indicating that Docker has been in a wait state running ‘docker rm *id*’. After a time this is recoverable and the daemon becomes responsive again.

In our case, having the Condor instance set a node to ATTR_HAS_DOCKER = false does lead to a long period of that node draining and not running work before the Condor daemon is restarted or the node is rebooted. An ideal situation would be a configurable retesting of the Docker daemon, for example, if the execution point takes Docker offline, a Condor knob can be set detailing a retesting time-period. Something along the lines of DOCKER_RETEST_PERIOD = 10 (Retest every 10 min if universe is offline)? Not setting this config knob would revert to the old behaviour of Condor and maintain the offline state of the container universe.

We have found execution points be suitably loaded that the Docker daemon has taken longer than 2 min to respond, it’s rare but has been seen. We do run local xrootd proxy’s on our execution points so it’s not unreasonable to expect an I/O constraint when a worker is fully loaded. This has no effect on currently running payloads but does affect interaction with Docker during this time. I do see how a “flapping” situation could occur with such a config option and maybe that does lead to a greater question of optimising our job flow to space out job executions across these larger nodes to reduce bursts of load.

Many thanks,

Tom

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Dudu Handelman <duduhandelman@xxxxxxxxxxx>
Date: Thursday, 16 November 2023 at 19:51
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Docker hang re-evaluation

Thanks Todd.

In my case its above 99 percent of the time a storage issue.

I think it's good to serializes the docker request.

David.

Get Outlook for Android

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Todd Tannenbaum via HTCondor-users <htcondor-users@xxxxxxxxxxx>

Sent: Thursday, November 16, 2023 9:38:59 PM

To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>

Cc: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Docker hang re-evaluation

Hi folks,

Agree with everyone that when the execution point takes Docker offline because it is unhealthy, it would be awfully nice if it periodically retested to see if it recovered. However, in this particular instance, my guess is on a heavily loaded EP (i.e. 250 cores, short jobs, ...) this is likely to result in continuous oscillating between healthy/sick/healthy/sick. What do folks think of the idea that the EP serializes requests to the Docker daemon? I.e., instead of bombarding it with 250 requests in a short period of time, it only has X number of requests in-flight at any point? Or is the issues you've seen have more to do with an overloaded file system (i.e. the volume holding all the Docker images) rather than an overloaded Docker daemon, in which case serializing Docker requests would have less impact?

Oh, and answering Jose's question: Unfortunately, the timeout HTCSS uses for interacting with Docker is currently hard-coded at 120 seconds (2 minutes). If we were to make this a configurable parameter, do you you feel it is reasonable to ever set this to be above 2 minutes? I.e. have you observed Docker being so overloaded to the point where a timeout of more than 2 minutes would result in successful operation?

Thanks for your feedback/input!

regards,
Todd

On 11/16/2023 12:11 PM, Dudu Handelman wrote:

Hi All.

Yes I have seen it.

Most of the time it relates to storage issue. For example, job is running and user decide to remove the job. So condor will run docker stop/rm and docker trying to kill the process while the process try to close/write/open only when the systen call is back the process will stop. So the timeout is reasonable.

I think we need a periodic docker check that will bring the docker universe back to online.

Thanks

David

Get Outlook for Android

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jose Caballero <jcaballero.hep@xxxxxxxxx>
Sent: Thursday, November 16, 2023 9:40:09 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Docker hang re-evaluation

Hi,

Has anybody else seen this behaviour? If so, how did you fix it?

Or, is there some classad for the timeout that can be adjusted?

Any comment is more than welcome.

Cheers,

Jose

El mar, 14 nov 2023 a las 15:28, Thomas Birkett - STFC UKRI via HTCondor-users (<htcondor-users@xxxxxxxxxxx>) escribió:

Hi all,

Hope everyone is keeping well. I have an interesting issue/irregular situation that occurs with our workernodes. We currently run Docker containers on our workers with Condor 10.0.9. Some of our newer workernodes can run ~250 jobs per physical node and this can lead to a highly loaded system. Due to this, there are times that Docker can be slow to respond or give the impression of a hang, leading to the following ClassAds for the Startd:

DockerOfflineReason = Docker hung trying to rm an orphaned container

And sets ATTR_HAS_DOCKER = false

Looking at the source I see this behaviour defined: https://github.com/htcondor/htcondor/blob/main/src/condor_startd.V6/util.cpp#L244C34-L244C34

As the Docker hang is ofttimes recoverable, is there functionality in Condor to re-evaluate Docker’s status without having to restart the Condor daemon or manually amending these ClassAds?

Many thanks,

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot
OX11 0QX

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--

Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison

Center for High Throughput Computing    Department of Computer Sciences

Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257

Mailing List Archives

Public Access

Re: [HTCondor-users] Docker hang re-evaluation