[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hanging docker jobs after finished processes



Hi all,

I have updated docker and HTCondor on some machines.

Docker version 18.06.1-ce, build e68fc7a
CondorVersion: 8.7.9 Jul 31 2018 BuildID: 446081
Kernel: 3.10.0-693.11.6.el7.x86_64 #1 SMP

I got the same problem with these versions as well . The processes are completed successfully, but the docker container is still their. The machine with the mainline kernel does currently not shown this problem. I will update the kernel on the other machines to the current mainline kernel.

Cheers,

Matthias

On 9/17/18 8:58 PM, Matthias Schnepf wrote:

Hi Todd,

Currently, we have different kind of jobs where the processes run successfully but the docker container around is still there:

 0.00 B/s 18710 condor 20 0 68168 6784 5420 S 0.0 0.0 0:05.06 â â ââ condor_starter -f -a slot1_7 schedd1
 0.00 B/s 18714 condor 20 0 857M 9588 6052 S 0.0 0.0 0:01.28 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_PID1
 0.00 B/s 19037 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 19036 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.03 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18995 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18994 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.03 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18982 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18981 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.03 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18953 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.03 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18952 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.02 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18941 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18940 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18759 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18758 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18757 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.03 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18727 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18726 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18725 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18724 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18723 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18722 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18721 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.03 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18720 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.03 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18719 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.04 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18718 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.05 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18717 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.03 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18716 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.00 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18715 condor 20 0 857M 9588 6052 S 0.0 0.0 0:00.03 â â â ââ /usr/bin/docker run --cpu-shares=10 --memory=3072m --cap-drop=all --hostname mschnepf-15764.0-worker1 --name HTCJob15764_0_slot1_7_P
 0.00 B/s 18614 condor 20 0 68168 6724 5420 S 0.0 0.0 0:05.22 â â ââ condor_starter -f -a slot1_6
schedd1

This happened to all jobs on a machine.

The used HTCondor version is 8.6.5 Aug 05 2017 BuildID: 412177 and docker version 17.05.0-ce, build 89658be. All machines are CentOS 7 machines with Kernel 3.10.0-693.11.6.el7.x86_64. I installed on one machine the mainline kernel 4.18.7-1.el7.elrepo.x86_64. However, it happens also on the mainline kernel machine.

When this happened, commands such as docker ps hangs. After a restart of the docker daemon, it works for a while.

Has someone the same or similar problems and a solution?

Cheers and thanks,

Matthias


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/