[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Docker hang re-evaluation



Hi,

Has anybody else seen this behaviour? If so, how did you fix it?
Or, is there some classad for the timeout that can be adjusted?
Any comment is moreÂthan welcome.

Cheers,
Jose

El mar, 14 nov 2023 a las 15:28, Thomas Birkett - STFC UKRI via HTCondor-users (<htcondor-users@xxxxxxxxxxx>) escribiÃ:

Hi all,

Â

Hope everyone is keeping well. I have an interesting issue/irregular situation that occurs with our workernodes. We currently run Docker containers on our workers with Condor 10.0.9. Some of our newer workernodes can run ~250 jobs per physical node and this can lead to a highly loaded system. Due to this, there are times that Docker can be slow to respond or give the impression of a hang, leading to the following ClassAds for the Startd:

Â

DockerOfflineReason = Docker hung trying to rm an orphaned container

And sets ATTR_HAS_DOCKER = false

Â

Looking at the source I see this behaviour defined: https://github.com/htcondor/htcondor/blob/main/src/condor_startd.V6/util.cpp#L244C34-L244C34

Â

As the Docker hang is ofttimes recoverable, is there functionality in Condor to re-evaluate Dockerâs status without having to restart the Condor daemon or manually amending these ClassAds?

Â

Many thanks,

Â

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department Â

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, DidcotÂ
OX11 0QX

Â

signature_609518872

Â

Â

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/