[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Docker centos lock error



Hi guys,

I'm trying to run a Job in Docker Universe but it fails to create a lockfile so the job remains Idle. Here is the log trace from the submitter host (10.10.10.3)

------------------------------------------------------------------------------------------------------
000 (008.000.000) 08/30 03:15:19 Job submitted from host: <10.10.10.3:8080?addrs=10.10.10.3-8080>
...
001 (008.000.000) 08/30 03:15:21 Job executing on host: <10.10.10.5:4755?addrs=10.10.10.5-4755>
...
022 (008.000.000) 08/30 03:15:21 Job disconnected, attempting to reconnect
ÂÂÂ Socket between submit and execute hosts closed unexpectedly
ÂÂÂ Trying to reconnect to server2 <10.10.10.5:4755?addrs=10.10.10.5-4755>
...
024 (008.000.000) 08/30 03:15:21 Job reconnection failed
ÂÂÂ Job not found at execution machine
ÂÂÂ Can not reconnect to server2, rescheduling job
------------------------------------------------------------------------------------------------------

An here from /var/log/condor/StarterLog in the running host (10.10.10.5):

------------------------------------------------------------------------------------------------------
08/30/16 03:19:19 (pid:4653) Communicating with shadow <10.10.10.3:27654?addrs=10.10.10.3-27654&noUDP>
08/30/16 03:19:19 (pid:4653) Submitting machine is "10.10.10.3"
08/30/16 03:19:19 (pid:4653) setting the orig job name in starter
08/30/16 03:19:19 (pid:4653) setting the orig job iwd in starter
08/30/16 03:19:19 (pid:4653) Chirp config summary: IO false, Updates false, Delayed updates true.
08/30/16 03:19:19 (pid:4653) Initialized IO Proxy.
08/30/16 03:19:19 (pid:4653) Done setting resource limits
08/30/16 03:19:19 (pid:4653) File transfer completed successfully.
08/30/16 03:19:20 (pid:4653) Job 8.0 set to execute immediately
08/30/16 03:19:20 (pid:4653) Starting a VANILLA universe job with ID: 8.0
08/30/16 03:19:20 (pid:4653) Output file: /var/lib/condor/execute/dir_4653/_condor_stdout
08/30/16 03:19:20 (pid:4653) Error file: /var/lib/condor/execute/dir_4653/_condor_stderr
08/30/16 03:19:20 (pid:4653) lock_file returning ERROR, errno=9 (Bad file descriptor)
08/30/16 03:19:20 (pid:4653) FileLock::obtain(1) failed - errno 9 (Bad file descriptor)
08/30/16 03:19:20 (pid:4653) Found 1 entries in docker image cache.
08/30/16 03:19:20 (pid:4653) lock_file returning ERROR, errno=9 (Bad file descriptor)
08/30/16 03:19:20 (pid:4653) FileLock::obtain(2) failed - errno 9 (Bad file descriptor)
08/30/16 03:19:20 (pid:4653) Create_Process(/usr/bin/docker): child failed because PRIV_CONDOR_FINAL process was still root before exec()
08/30/16 03:19:20 (pid:4653) Create_Process() failed.
08/30/16 03:19:20 (pid:4653) DockerAPI::run( haskell, alex, ... ) failed with return value -1
08/30/16 03:19:20 (pid:4653) Failed to start job, exiting
08/30/16 03:19:20 (pid:4653) ShutdownFast all jobs.
08/30/16 03:19:20 (pid:4653) **** condor_starter (condor_STARTER) pid 4653 EXITING WITH STATUS 0
------------------------------------------------------------------------------------------------------

Condor version: $CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $ $CondorPlatform: x86_64_RedHat7 $
Docker Version: Docker version 1.12.1, build 23cf638

Do you know how to solve this problem?

Thanks in advance,
Carlos R