[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] File permission problem that I don't understand.



Hi,

Iâm trying to use HTCondor to replace our not-supported Xgrid setup. I started one controller on a OS X 10.11 box, and one executor on a OS X 10.8 box. Iâm submitting the  io  example, and Iâm getting this:

ID      OWNER          HELD_SINCE  HOLD_REASON
  1.0   condor          5/10 15:00 Error from slot1@xxxxxxxxxxxxxxxxxxxx: STARTER at 192.168.1.170 failed to send file(s) to <192.168.1.197:9618>; SHADOW at 192.168.1.197 failed to write to file /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/examples/tmp: (errno 13) Permission denied
  2.0   condor          5/10 15:00 Error from slot2@xxxxxxxxxxxxxxxxxxxx: STARTER at 192.168.1.170 failed to send file(s) to <192.168.1.197:9618>; SHADOW at 192.168.1.197 failed to write to file /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/examples/tmp: (errno 13) Permission denied

condor is present and has access to /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/examples/. Doing some Googling, it seems that the problem can happen if UID_ADMIN is not the same, but it is. I even chmod 777 /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/examples in case itâs nobody who try to create the file, but no changes.

StartLog on the execution box:

05/10/16 15:08:48 slot1: Request accepted.
05/10/16 15:08:49 slot1: Remote owner is condor@druide
05/10/16 15:08:49 slot1: State change: claiming protocol successful
05/10/16 15:08:49 slot1: Changing state: Unclaimed -> Claimed
05/10/16 15:08:49 slot2: Request accepted.
05/10/16 15:08:49 slot2: Remote owner is condor@druide
05/10/16 15:08:49 slot2: State change: claiming protocol successful
05/10/16 15:08:49 slot2: Changing state: Unclaimed -> Claimed
05/10/16 15:08:49 slot1: Got activate_claim request from shadow (192.168.1.197)
05/10/16 15:08:49 slot1: Remote job ID is 1.0
05/10/16 15:08:49 slot1: Got universe "VANILLA" (5) from request classad
05/10/16 15:08:49 slot1: State change: claim-activation protocol successful
05/10/16 15:08:49 slot1: Changing activity: Idle -> Busy
05/10/16 15:08:49 slot2: Got activate_claim request from shadow (192.168.1.197)
05/10/16 15:08:49 slot2: Remote job ID is 2.0
05/10/16 15:08:49 slot2: Got universe "VANILLA" (5) from request classad
05/10/16 15:08:49 slot2: State change: claim-activation protocol successful
05/10/16 15:08:49 slot2: Changing activity: Idle -> Busy
05/10/16 15:08:50 slot2: Called deactivate_claim_forcibly()
05/10/16 15:08:50 slot1: Called deactivate_claim_forcibly()
05/10/16 15:08:50 slot2: State change: received RELEASE_CLAIM command
05/10/16 15:08:50 slot2: Changing state and activity: Claimed/Busy -> Preempting/Vacating
05/10/16 15:08:50 slot1: State change: received RELEASE_CLAIM command
05/10/16 15:08:50 slot1: Changing state and activity: Claimed/Busy -> Preempting/Vacating
05/10/16 15:08:50 Starter pid 307 exited with status 0
05/10/16 15:08:50 slot1: State change: starter exited
05/10/16 15:08:50 slot1: State change: No preempting claim, returning to owner
05/10/16 15:08:50 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
05/10/16 15:08:50 slot1: State change: IS_OWNER is false
05/10/16 15:08:50 slot1: Changing state: Owner -> Unclaimed
05/10/16 15:08:50 Starter pid 309 exited with status 0
05/10/16 15:08:50 slot2: State change: starter exited
05/10/16 15:08:50 slot2: State change: No preempting claim, returning to owner
05/10/16 15:08:50 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle
05/10/16 15:08:50 slot2: State change: IS_OWNER is false
05/10/16 15:08:50 slot2: Changing state: Owner -> Unclaimed

StarterLog.slot1 on the execution box:

05/10/16 15:08:49 (pid:307) Can't open directory "/usr/local/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
05/10/16 15:08:49 (pid:307) Cannot open /usr/local/condor/config: No such file or directory
05/10/16 15:08:49 (pid:307) ******************************************************
05/10/16 15:08:49 (pid:307) ** condor_starter (CONDOR_STARTER) STARTING UP
05/10/16 15:08:49 (pid:307) ** /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/sbin/condor_starter
05/10/16 15:08:49 (pid:307) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
05/10/16 15:08:49 (pid:307) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
05/10/16 15:08:49 (pid:307) ** $CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
05/10/16 15:08:49 (pid:307) ** $CondorPlatform: x86_64_MacOSX7 $
05/10/16 15:08:49 (pid:307) ** PID = 307
05/10/16 15:08:49 (pid:307) ** Log last touched 5/10 15:00:18
05/10/16 15:08:49 (pid:307) ******************************************************
05/10/16 15:08:49 (pid:307) Using config source: /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/etc/condor_config
05/10/16 15:08:49 (pid:307) Using local config sources: 
05/10/16 15:08:49 (pid:307)    /usr/local/condor/condor_config.local
05/10/16 15:08:49 (pid:307) config Macros = 61, Sorted = 60, StringBytes = 1641, TablesBytes = 2244
05/10/16 15:08:49 (pid:307) CLASSAD_CACHING is OFF
05/10/16 15:08:49 (pid:307) Daemon Log is logging: D_ALWAYS D_ERROR
05/10/16 15:08:49 (pid:307) SharedPortEndpoint: waiting for connections to named socket 217_a562_5
05/10/16 15:08:49 (pid:307) DaemonCore: command socket at <192.168.1.170:9618?addrs=192.168.1.170-9618+[--1]-9618&noUDP&sock=217_a562_5>
05/10/16 15:08:49 (pid:307) DaemonCore: private command socket at <192.168.1.170:9618?addrs=192.168.1.170-9618+[--1]-9618&noUDP&sock=217_a562_5>
05/10/16 15:08:49 (pid:307) GLEXEC_JOB not supported on this platform; ignoring
05/10/16 15:08:49 (pid:307) Communicating with shadow <192.168.1.197:9618?addrs=192.168.1.197-9618+[--1]-9618&noUDP&sock=1899_c466_9>
05/10/16 15:08:49 (pid:307) Submitting machine is "condor-maitre.druide"
05/10/16 15:08:49 (pid:307) setting the orig job name in starter
05/10/16 15:08:49 (pid:307) setting the orig job iwd in starter
05/10/16 15:08:49 (pid:307) Chirp config summary: IO false, Updates false, Delayed updates true.
05/10/16 15:08:49 (pid:307) Initialized IO Proxy.
05/10/16 15:08:49 (pid:307) Setting resource limits not implemented!
05/10/16 15:08:49 (pid:307) File transfer completed successfully.
05/10/16 15:08:50 (pid:307) Job 1.0 set to execute immediately
05/10/16 15:08:50 (pid:307) Starting a VANILLA universe job with ID: 1.0
05/10/16 15:08:50 (pid:307) IWD: /usr/local/condor/execute/dir_307
05/10/16 15:08:50 (pid:307) Output file: /usr/local/condor/execute/dir_307/_condor_stdout
05/10/16 15:08:50 (pid:307) Error file: /usr/local/condor/execute/dir_307/_condor_stderr
05/10/16 15:08:50 (pid:307) Renice expr "0" evaluated to 0
05/10/16 15:08:50 (pid:307) About to exec /usr/local/condor/execute/dir_307/condor_exec.exe 5
05/10/16 15:08:50 (pid:307) Running job as user condor
05/10/16 15:08:50 (pid:307) Create_Process succeeded, pid=318
05/10/16 15:08:50 (pid:307) Process exited, pid=318, status=0
05/10/16 15:08:50 (pid:307) DoUpload: (Condor error code 12, subcode 13) STARTER at 192.168.1.170 failed to send file(s) to <192.168.1.197:9618>; SHADOW at 192.168.1.197 failed to write to file /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/examples/tmp: (errno 13) Permission denied
05/10/16 15:08:50 (pid:307) JICShadow::notifyJobTermination(): Sending mock terminate event.
05/10/16 15:08:50 (pid:307) JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt
05/10/16 15:08:50 (pid:307) Returning from CStarter::JobReaper()
05/10/16 15:08:50 (pid:307) Got SIGQUIT.  Performing fast shutdown.
05/10/16 15:08:50 (pid:307) ShutdownFast all jobs.
05/10/16 15:08:50 (pid:307) Lost connection to shadow, waiting 2400 secs for reconnect
05/10/16 15:08:50 (pid:307) Failed to send job exit status to shadow
05/10/16 15:08:50 (pid:307) **** condor_starter (condor_STARTER) pid 307 EXITING WITH STATUS 0


ShadowLog on the controller :

05/10/16 15:08:48 Can't open directory "/usr/local/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
05/10/16 15:08:48 Cannot open /usr/local/condor/config: No such file or directory
05/10/16 15:08:48 ******************************************************
05/10/16 15:08:48 ** condor_shadow (CONDOR_SHADOW) STARTING UP
05/10/16 15:08:48 ** /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/sbin/condor_shadow
05/10/16 15:08:48 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
05/10/16 15:08:48 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
05/10/16 15:08:48 ** $CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
05/10/16 15:08:48 ** $CondorPlatform: x86_64_MacOSX7 $
05/10/16 15:08:48 ** PID = 2262
05/10/16 15:08:48 ** Log last touched 5/10 15:00:18
05/10/16 15:08:48 ******************************************************
05/10/16 15:08:48 Using config source: /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/etc/condor_config
05/10/16 15:08:48 Using local config sources: 
05/10/16 15:08:48    /usr/local/condor/condor_config.local
05/10/16 15:08:48 config Macros = 59, Sorted = 59, StringBytes = 1665, TablesBytes = 992
05/10/16 15:08:48 CLASSAD_CACHING is OFF
05/10/16 15:08:48 Daemon Log is logging: D_ALWAYS D_ERROR
05/10/16 15:08:48 SharedPortEndpoint: waiting for connections to named socket 1899_c466_9
05/10/16 15:08:48 DaemonCore: command socket at <192.168.1.197:9618?addrs=192.168.1.197-9618+[--1]-9618&noUDP&sock=1899_c466_9>
05/10/16 15:08:48 DaemonCore: private command socket at <192.168.1.197:9618?addrs=192.168.1.197-9618+[--1]-9618&noUDP&sock=1899_c466_9>
05/10/16 15:08:48 Initializing a VANILLA shadow for job 1.0
05/10/16 15:08:49 (1.0) (2262): Request to run on slot1@xxxxxxxxxxxxxxxxxxxx <192.168.1.170:9618?addrs=192.168.1.170-9618+[--1]-9618&noUDP&sock=211_3756_4> was ACCEPTED
05/10/16 15:08:49 Can't open directory "/usr/local/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
05/10/16 15:08:49 Cannot open /usr/local/condor/config: No such file or directory
05/10/16 15:08:49 ******************************************************
05/10/16 15:08:49 ** condor_shadow (CONDOR_SHADOW) STARTING UP
05/10/16 15:08:49 ** /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/sbin/condor_shadow
05/10/16 15:08:49 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
05/10/16 15:08:49 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
05/10/16 15:08:49 ** $CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
05/10/16 15:08:49 ** $CondorPlatform: x86_64_MacOSX7 $
05/10/16 15:08:49 ** PID = 2264
05/10/16 15:08:49 ** Log last touched 5/10 15:08:49
05/10/16 15:08:49 ******************************************************
05/10/16 15:08:49 Using config source: /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/etc/condor_config
05/10/16 15:08:49 Using local config sources: 
05/10/16 15:08:49    /usr/local/condor/condor_config.local
05/10/16 15:08:49 config Macros = 59, Sorted = 59, StringBytes = 1665, TablesBytes = 992
05/10/16 15:08:49 CLASSAD_CACHING is OFF
05/10/16 15:08:49 Daemon Log is logging: D_ALWAYS D_ERROR
05/10/16 15:08:49 SharedPortEndpoint: waiting for connections to named socket 1899_c466_10
05/10/16 15:08:49 DaemonCore: command socket at <192.168.1.197:9618?addrs=192.168.1.197-9618+[--1]-9618&noUDP&sock=1899_c466_10>
05/10/16 15:08:49 DaemonCore: private command socket at <192.168.1.197:9618?addrs=192.168.1.197-9618+[--1]-9618&noUDP&sock=1899_c466_10>
05/10/16 15:08:49 Initializing a VANILLA shadow for job 2.0
05/10/16 15:08:49 (2.0) (2264): Request to run on slot2@xxxxxxxxxxxxxxxxxxxx <192.168.1.170:9618?addrs=192.168.1.170-9618+[--1]-9618&noUDP&sock=211_3756_4> was ACCEPTED
05/10/16 15:08:49 (1.0) (2262): File transfer completed successfully.
05/10/16 15:08:49 (2.0) (2264): File transfer completed successfully.
05/10/16 15:08:50 (2.0) (2264): get_file(): Failed to open file /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/examples/tmp, errno = 13: Permission denied.
05/10/16 15:08:50 (2.0) (2264): get_file(): consumed 16384 bytes of file transmission
05/10/16 15:08:50 (1.0) (2262): get_file(): Failed to open file /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/examples/tmp, errno = 13: Permission denied.
05/10/16 15:08:50 (2.0) (2264): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 192.168.1.197 failed to write to file /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/examples/tmp: (errno 13) Permission denied
05/10/16 15:08:50 (1.0) (2262): get_file(): consumed 16384 bytes of file transmission
05/10/16 15:08:50 (1.0) (2262): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 192.168.1.197 failed to write to file /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/examples/tmp: (errno 13) Permission denied
05/10/16 15:08:50 (2.0) (2264): Mock terminating job 2.0: exited_by_signal=FALSE, exit_code=0 OR exit_signal=0, core_dumped=FALSE, exit_reason="Exited normally"
05/10/16 15:08:50 (1.0) (2262): Mock terminating job 1.0: exited_by_signal=FALSE, exit_code=0 OR exit_signal=0, core_dumped=FALSE, exit_reason="Exited normally"
05/10/16 15:08:50 (2.0) (2264): File transfer failed (status=0).
05/10/16 15:08:50 (1.0) (2262): File transfer failed (status=0).
05/10/16 15:08:50 (2.0) (2264): Job 2.0 going into Hold state (code 12,13): Error from slot2@xxxxxxxxxxxxxxxxxxxx: STARTER at 192.168.1.170 failed to send file(s) to <192.168.1.197:9618>; SHADOW at 192.168.1.197 failed to write to file /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/examples/tmp: (errno 13) Permission denied
05/10/16 15:08:50 (1.0) (2262): Job 1.0 going into Hold state (code 12,13): Error from slot1@xxxxxxxxxxxxxxxxxxxx: STARTER at 192.168.1.170 failed to send file(s) to <192.168.1.197:9618>; SHADOW at 192.168.1.197 failed to write to file /Users/condor/condor-8.5.4-x86_64_MacOSX7-stripped/examples/tmp: (errno 13) Permission denied
05/10/16 15:08:50 (2.0) (2264): **** condor_shadow (condor_SHADOW) pid 2264 EXITING WITH STATUS 112
05/10/16 15:08:50 (1.0) (2262): **** condor_shadow (condor_SHADOW) pid 2262 EXITING WITH STATUS 112