[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Docker Universe jobs failing because of a file transfer problem



Hi,

 

This issue appears both when using the htcondor/mini image or the trio htcondor/cm â htcondor/submit â htcondor/execute images.

What I do is the following:

  1. Launch the docker containers using the images. If using htcondor/mini, I use the host network and mount /var/run/docker.sock in it.
    If using the other images, I connect them to a docker network (condor-network) that I created beforehand. I also mount /var/run/docker.sock to the execute nodes.
  2. In the relevant container, I run chmod 666 /var/run/docker.sock, then condor_restart
  3. I use condor_status slot1@xxxxxxxxxxxxxxxxx -json | grep Has to check the presence of the âHasDockerâ property
  4. I submit the following job:
    universe                              = docker
    docker_image                   = python:3.8.10
    should_transfer_files      = yes
    executable                         = /usr/bin/python
    arguments                          = test.py
    transfer_input_files         = test.py
    output                                 = test_docker.out
    error                                   = test_docker.err
    log                                       = test_docker.log
    initial_dir                           = /tmp
    queue 1

    Where test.py just print âLOLâ in stdout
  5. I wait for the job to finish
  6. I check the /tmp/test_docker.out file: it is empty
  7. I check the /tmp/test_docker.err file: I get â/usr/bin/python: can't open file 'test.py': [Errno 2] No such file or directoryâ

 

A few more precisions:

  • When I try to run the same job on a bare metal pool instead of one running in Docker, it works and the job succeeds
  • The ShadowLog on the submit node looks like this:
    04/12/23 17:27:36 Initializing a VANILLA shadow for job 5.0
    04/12/23 17:27:36 (5.0) (9682): LIMIT_DIRECTORY_ACCESS = <unset>
    04/12/23 17:27:36 (5.0) (9682): Request to run on slot1_1@xxxxxxxxxxxxxxxxx <172.18.0.4:9618?CCBID=172.18.0.2:9618%3faddrs%3d172.18.0.2-9618%26alias%3dcondor.test.cm%26noUDP%26sock%3dcollector#102&PrivNet=condor.test.node1&addrs=172.18.0.4-9618&alias=condor.test.node1&noUDP&sock=startd_157_4475> was ACCEPTED
    04/12/23 17:27:36 (5.0) (9682): File transfer completed successfully.
    04/12/23 17:27:40 (5.0) (9682): File transfer completed successfully.
    04/12/23 17:27:42 (5.0) (9682): Job 5.0 terminated: exited with status 2
    04/12/23 17:27:42 (5.0) (9682): Reporting job exit reason 100 and attempting to fetch new job.
    04/12/23 17:27:42 (5.0) (9682): **** condor_shadow (condor_SHADOW) pid 9682 EXITING WITH STATUS 100
  • I tried variants of this with other docker images than the python one, with the same result

 

I have unit tests depending on this docker pool to run in a gitlab CI, and they obviously canât pass if every job fails to transfer the files â

Thanks,

 

GaÃtan

 


Gaetan Geffroy
Junior Software Engineer, Space

Terma GmbH
Europaarkaden II, BratustraÃe 7, 64293 Darmstadt, Germany
T +49 6151 86005 43 (direct)  â  T +49 6151 86005-0
Terma GmbH - Sitz Darmstadt  â  Handelsregister Nr.: HRB 7411, Darmstadt
GeschÃftsfÃhrer: Poul Vigh / Steen Vejby SÃrensen
www.terma.com â 
Linkedin â Twitter â Instagram â Youtube


Attention:
This e-mail (and attachment(s), if any) - intended for the addressee(s) only - may contain confidential, copyright, or legally privileged information or material, and no one else is authorized to read, print, store, copy, forward, or otherwise use or disclose any part of its contents or attachment(s) in any form. If you have received this e-mail in error, please notify me by telephone or return e-mail, and delete this e-mail and attachment(s). Thank you.