[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] docker job don't start caused by corrupted .startd_docker_images file



Hi all,

on some of our worker nodes, docker jobs didn't start. The starter log showed in fulldebug mode:

...
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Completed DC_CHILDALIVE to daemon at <129.13.101.177:9618>
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2) Sending GoAhead for 129.13.101.141 to send /var/lib/condor/execute/dir_14497/condor_exec.exe and all further files.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) DaemonCore: Leaving SendAliveToParent() - success
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2) Received GoAhead from peer to receive /var/lib/condor/execute/dir_14497/condor_exec.exe.
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2) get_file(): going to write to filename /var/lib/condor/execute/dir_14497/condor_exec.exe
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2) get_file: Receiving 2127 bytes
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2) get_file: wrote 2127 bytes to file
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2) ReliSock::get_file_with_permissions(): going to set permissions 777
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) DaemonCore: No more children processes to reap.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) File transfer completed successfully.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Calling client FileTransfer handler function.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) HOOK_PREPARE_JOB not configured.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Job 568739.0 set to execute immediately
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Starting a VANILLA universe job with ID: 568739.0
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) In OsProc::OsProc()
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Main job KillSignal: 15 (SIGTERM)
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Main job RmKillSignal: 15 (SIGTERM)
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Main job HoldKillSignal: 15 (SIGTERM)
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Cmd: 'condor_exec.exe'
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Input file: /dev/null
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Output file: /var/lib/condor/execute/dir_14497/_condor_stdout
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Error file: /var/lib/condor/execute/dir_14497/_condor_stderr
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Adding /cvmfs:/cvmfs as a docker volume to mount
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) About to exec docker:./condor_exec.exe
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) FileLock object is updating timestamp on: /var/log/condor/.startd_docker_images
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) FileLock::obtain(1) - @1522828151.058259 lock on /var/log/condor/.startd_docker_images now WRITE
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Found 32 entries in docker image cache.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Attempting to run: /usr/bin/docker rmi \n
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Attempting to run: '/usr/bin/docker images -q \n'.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Attempting to run: /usr/bin/docker rmi 0
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Attempting to run: '/usr/bin/docker images -q <80>'.
Stack dump for process 14497 at timestamp 1522828152 (16 frames)
/lib64/libcondor_utils_8_6_5.so(dprintf_dump_stack+0x72)[0x7f1c7ce6ee32]
/lib64/libcondor_utils_8_6_5.so(_Z18linux_sig_coredumpi+0x24)[0x7f1c7cff9434]
/lib64/libpthread.so.0(+0xf5e0)[0x7f1c7b5485e0]
/lib64/libc.so.6(gsignal+0x37)[0x7f1c7b1ab1f7]
/lib64/libc.so.6(abort+0x148)[0x7f1c7b1ac8e8]
/lib64/libc.so.6(+0x74f47)[0x7f1c7b1eaf47]
/lib64/libc.so.6(+0x7c619)[0x7f1c7b1f2619]
/lib64/libcondor_utils_8_6_5.so(_ZN9DockerAPI3runERN14compat_classad7ClassAdES2_RKSsS4_S4_RK7ArgListRK3EnvS4_St4listISsSaISsEERiPiR11CondorError+0xeac)[0x7f1c7ce3334c]
condor_starter(_ZN10DockerProc8StartJobEv+0xb66)[0x4547b6]
condor_starter(_ZN8CStarter8SpawnJobEv+0xc3)[0x45b8c3]
condor_starter(_ZN8CStarter14SpawnPreScriptEv+0x197)[0x4598c7]
/lib64/libcondor_utils_8_6_5.so(_ZN12TimerManager7TimeoutEPiPd+0x182)[0x7f1c7cff8712]
/lib64/libcondor_utils_8_6_5.so(_ZN10DaemonCore6DriverEv+0x9cb)[0x7f1c7cfdc7fb]
/lib64/libcondor_utils_8_6_5.so(_Z7dc_mainiPPc+0x13a4)[0x7f1c7cffcaa4]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f1c7b197c05]
condor_starter[0x422840]


I removed the file /var/log/condor/.startd_docker_images on the corresponding worker nodes and now docker jobs run correctly again. I put one of these files in the attachment. The file on the corrupted nodes has some line breaks at the beginning of the file.

I removed some old docker images on that machines. Could this cause the corrupted startd_docker_image file?

Cheers,

Matthias
































mschnepf/slc6-condocker