[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Extend Docker container removal timeout



Hi Cole,

Our wrapper script is mostly legacy. It was created many years ago, and I believe the native support for containers in HTCondor was not as sophisticatedÂas it has become recently.Â
It mostly sets some parameters for the docker commands, in particular env vars for specific users. Do I understand all of that can now be done using classads?

Said that, and Tom can correct meÂif I'm wrong, I don't believe we have any timeout in that wrapper script. Not explicitly at least.Â

Cheers,
Jose





El jue, 25 may 2023 a las 23:26, Cole Bollig via HTCondor-users (<htcondor-users@xxxxxxxxxxx>) escribiÃ:
Hi Thomas,

Looking at the log message posted it seems like the docker command is getting hung and the wrapper script /usr/local/bin/docker.py is the one declaring the timeout and exiting a failure. If it was condor declaring the timeout you would seem some sort of message like provided followed immediately by "Declaring a hung docker".Â

What is the reason for using a wrapper python script around the docker commands?

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, May 23, 2023 5:55 AM
To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Cc: Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx>
Subject: [HTCondor-users] Extend Docker container removal timeout
Â

Hi all,

Â

I hope everyone is keeping well. Quick question for the community, we have intermittent timeouts for containers on nodes with the logs detailing the following:

Â

condor_startd[3404]: Failed to read results from '/usr/local/bin/docker.py rm -f -v HTCJob2693132_0_slot1_8_PID2466140': 'Timed out waiting for program to exit' (110)

Â

Is there a knob / config option that exists for extending the removal timeout value for containers and jobs on startdâs? Docker does eventually remove the container but as the workers have very high I/O at times, the node may need more time to supply a response to the startd. Weâre running Condor 9.0.15 across the pool.

Â

Many thanks in advance,

Â

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department Â

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, DidcotÂ
OX11 0QX

Â

signature_609518872

Â

Â

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/