Hi guys,
I'm trying to run a Job in Docker Universe but it fails to create a lockfile so the job remains Idle. Here is the log trace from the submitter host (10.10.10.3)
------------------------------------------------------------------------------------------------------
000 (008.000.000) 08/30 03:15:19 Job submitted from host: <
10.10.10.3:8080?addrs=10.10.10.3-8080>
...
001 (008.000.000) 08/30 03:15:21 Job executing on host: <
10.10.10.5:4755?addrs=10.10.10.5-4755>
...
022 (008.000.000) 08/30 03:15:21 Job disconnected, attempting to reconnect
ÂÂÂ Socket between submit and execute hosts closed unexpectedly
ÂÂÂ Trying to reconnect to server2 <
10.10.10.5:4755?addrs=10.10.10.5-4755>
...
024 (008.000.000) 08/30 03:15:21 Job reconnection failed
ÂÂÂ Job not found at execution machine
ÂÂÂ Can not reconnect to server2, rescheduling job
------------------------------------------------------------------------------------------------------
An here from /var/log/condor/StarterLog in the running host (10.10.10.5):
------------------------------------------------------------------------------------------------------
08/30/16 03:19:19 (pid:4653) Communicating with shadow <
10.10.10.3:27654?addrs=10.10.10.3-27654&noUDP>
08/30/16 03:19:19 (pid:4653) Submitting machine is "10.10.10.3"
08/30/16 03:19:19 (pid:4653) setting the orig job name in starter
08/30/16 03:19:19 (pid:4653) setting the orig job iwd in starter
08/30/16 03:19:19 (pid:4653) Chirp config summary: IO false, Updates false, Delayed updates true.
08/30/16 03:19:19 (pid:4653) Initialized IO Proxy.
08/30/16 03:19:19 (pid:4653) Done setting resource limits
08/30/16 03:19:19 (pid:4653) File transfer completed successfully.
08/30/16 03:19:20 (pid:4653) Job 8.0 set to execute immediately
08/30/16 03:19:20 (pid:4653) Starting a VANILLA universe job with ID: 8.0
08/30/16 03:19:20 (pid:4653) Output file: /var/lib/condor/execute/dir_4653/_condor_stdout
08/30/16 03:19:20 (pid:4653) Error file: /var/lib/condor/execute/dir_4653/_condor_stderr
08/30/16 03:19:20 (pid:4653) lock_file returning ERROR, errno=9 (Bad file descriptor)
08/30/16 03:19:20 (pid:4653) FileLock::obtain(1) failed - errno 9 (Bad file descriptor)
08/30/16 03:19:20 (pid:4653) Found 1 entries in docker image cache.
08/30/16 03:19:20 (pid:4653) lock_file returning ERROR, errno=9 (Bad file descriptor)
08/30/16 03:19:20 (pid:4653) FileLock::obtain(2) failed - errno 9 (Bad file descriptor)
08/30/16 03:19:20 (pid:4653) Create_Process(/usr/bin/docker): child failed because PRIV_CONDOR_FINAL process was still root before exec()
08/30/16 03:19:20 (pid:4653) Create_Process() failed.
08/30/16 03:19:20 (pid:4653) DockerAPI::run( haskell, alex, ... ) failed with return value -1
08/30/16 03:19:20 (pid:4653) Failed to start job, exiting
08/30/16 03:19:20 (pid:4653) ShutdownFast all jobs.
08/30/16 03:19:20 (pid:4653) **** condor_starter (condor_STARTER) pid 4653 EXITING WITH STATUS 0
------------------------------------------------------------------------------------------------------
Condor version: $CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $ $CondorPlatform: x86_64_RedHat7 $