[HTCondor-users] Docker Universe jobs failing because of a file transfer problem

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi,

This issue appears both when using the htcondor/mini image or the trio htcondor/cm â htcondor/submit â htcondor/execute images.

What I do is the following:

Launch the docker containers using the images. If using htcondor/mini, I use the host network and mount /var/run/docker.sock in it.
If using the other images, I connect them to a docker network (condor-network) that I created beforehand. I also mount /var/run/docker.sock to the execute nodes.
In the relevant container, I run chmod 666 /var/run/docker.sock, then condor_restart
I use condor_status slot1@xxxxxxxxxxxxxxxxx -json | grep Has to check the presence of the âHasDockerâ property
I submit the following job:
universe                              = docker
docker_image                   = python:3.8.10
should_transfer_files      = yes
executable                         = /usr/bin/python
arguments                          = test.py
transfer_input_files         = test.py
output                                 = test_docker.out
error                                   = test_docker.err
log                                       = test_docker.log
initial_dir                           = /tmp
queue 1

Where test.py just print âLOLâ in stdout
I wait for the job to finish
I check the /tmp/test_docker.out file: it is empty
I check the /tmp/test_docker.err file: I get â/usr/bin/python: can't open file 'test.py': [Errno 2] No such file or directoryâ

A few more precisions:

When I try to run the same job on a bare metal pool instead of one running in Docker, it works and the job succeeds
The ShadowLog on the submit node looks like this:
04/12/23 17:27:36 Initializing a VANILLA shadow for job 5.0
04/12/23 17:27:36 (5.0) (9682): LIMIT_DIRECTORY_ACCESS = <unset>
04/12/23 17:27:36 (5.0) (9682): Request to run on slot1_1@xxxxxxxxxxxxxxxxx <172.18.0.4:9618?CCBID=172.18.0.2:9618%3faddrs%3d172.18.0.2-9618%26alias%3dcondor.test.cm%26noUDP%26sock%3dcollector#102&PrivNet=condor.test.node1&addrs=172.18.0.4-9618&alias=condor.test.node1&noUDP&sock=startd_157_4475> was ACCEPTED
04/12/23 17:27:36 (5.0) (9682): File transfer completed successfully.
04/12/23 17:27:40 (5.0) (9682): File transfer completed successfully.
04/12/23 17:27:42 (5.0) (9682): Job 5.0 terminated: exited with status 2
04/12/23 17:27:42 (5.0) (9682): Reporting job exit reason 100 and attempting to fetch new job.
04/12/23 17:27:42 (5.0) (9682): **** condor_shadow (condor_SHADOW) pid 9682 EXITING WITH STATUS 100
I tried variants of this with other docker images than the python one, with the same result

I have unit tests depending on this docker pool to run in a gitlab CI, and they obviously canât pass if every job fails to transfer the files â

Thanks,

GaÃtan

Gaetan Geffroy
Junior Software Engineer, Space

Terma GmbH
Europaarkaden II, BratustraÃe 7, 64293 Darmstadt, Germany
T +49 6151 86005 43 (direct) â T +49 6151 86005-0
Terma GmbH - Sitz Darmstadt â Handelsregister Nr.: HRB 7411, Darmstadt
GeschÃftsfÃhrer: Poul Vigh / Steen Vejby SÃrensen
www.terma.com â Linkedin â Twitter â Instagram â Youtube

Attention:
This e-mail (and attachment(s), if any) - intended for the addressee(s) only - may contain confidential, copyright, or legally privileged information or material, and no one else is authorized to read, print, store, copy, forward, or otherwise use or disclose any part of its contents or attachment(s) in any form. If you have received this e-mail in error, please notify me by telephone or return e-mail, and delete this e-mail and attachment(s). Thank you.

Mailing List Archives

Public Access

[HTCondor-users] Docker Universe jobs failing because of a file transfer problem