[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Write errors on secondary disk



So, an overview:


I have 3 machines in a condor cluster: herc0, herc1, and starscream. All of them mount home directories from a fourth, optimus. On each of the condor machines I have the home directories mounted at /nfs/optimus/home/ . herc0 has a secondary drive mounted locally as /local_data0 and also for all machines at /nfs/data_disks/herc0b .


Using the tutorial program, simple.c, I can successfully run the jobs in my home directory. All cores are used, all save write to disk. If I cd into a directory in  /nfs/data_disks/herc0b I get the following errors submitting:


[zdhughes@herc0 zdhughes]$ condor_submit submit 
Submitting job(s)..............................
30 job(s) submitted to cluster 81.

WARNING: File /nfs/data_disks/herc0b/users/zdhughes/simple.error is not writable by condor.

WARNING: File /nfs/data_disks/herc0b/users/zdhughes/simple.out is not writable by condor.

And the ShadowLog has:



09/09/16 18:50:42 ******************************************************
09/09/16 18:50:42 ** condor_shadow (CONDOR_SHADOW) STARTING UP
09/09/16 18:50:42 ** /usr/sbin/condor_shadow
09/09/16 18:50:42 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
09/09/16 18:50:42 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
09/09/16 18:50:42 ** $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $
09/09/16 18:50:42 ** $CondorPlatform: x86_64_RedHat7 $
09/09/16 18:50:42 ** PID = 30540
09/09/16 18:50:42 ** Log last touched 9/9 18:47:43
09/09/16 18:50:42 ******************************************************
09/09/16 18:50:42 Using config source: /etc/condor/condor_config
09/09/16 18:50:42 Using local config sources: 
09/09/16 18:50:42    /etc/condor/condor_config.local
09/09/16 18:50:42 config Macros = 71, Sorted = 71, StringBytes = 1828, TablesBytes = 1176
09/09/16 18:50:42 ******************************************************
09/09/16 18:50:42 CLASSAD_CACHING is OFF
09/09/16 18:50:42 ** condor_shadow (CONDOR_SHADOW) STARTING UP
09/09/16 18:50:42 ** /usr/sbin/condor_shadow
09/09/16 18:50:42 Daemon Log is logging: D_ALWAYS D_ERROR
09/09/16 18:50:42 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
09/09/16 18:50:42 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
09/09/16 18:50:42 ** $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $
09/09/16 18:50:42 ** $CondorPlatform: x86_64_RedHat7 $
09/09/16 18:50:42 ** PID = 30541
09/09/16 18:50:42 ** Log last touched 9/9 18:50:42
09/09/16 18:50:42 ******************************************************
09/09/16 18:50:42 Using config source: /etc/condor/condor_config
09/09/16 18:50:42 Using local config sources: 
09/09/16 18:50:42    /etc/condor/condor_config.local
09/09/16 18:50:42 config Macros = 71, Sorted = 71, StringBytes = 1828, TablesBytes = 1176
09/09/16 18:50:42 CLASSAD_CACHING is OFF
09/09/16 18:50:42 Daemon Log is logging: D_ALWAYS D_ERROR
09/09/16 18:50:42 Daemoncore: Listening at <0.0.0.0:19411> on TCP (ReliSock).
09/09/16 18:50:42 Daemoncore: Listening at <0.0.0.0:38254> on TCP (ReliSock).
09/09/16 18:50:42 DaemonCore: command socket at <10.0.7.10:19411?addrs=10.0.7.10-19411&noUDP>
09/09/16 18:50:42 DaemonCore: command socket at <10.0.7.10:38254?addrs=10.0.7.10-38254&noUDP>
09/09/16 18:50:42 DaemonCore: private command socket at <10.0.7.10:19411?addrs=10.0.7.10-19411>
09/09/16 18:50:42 DaemonCore: private command socket at <10.0.7.10:38254?addrs=10.0.7.10-38254>
09/09/16 18:50:42 Initializing a VANILLA shadow for job 81.1
09/09/16 18:50:42 Initializing a VANILLA shadow for job 81.0
09/09/16 18:50:42 (81.1) (30541): WriteUserLog::initialize: safe_open_wrapper("/nfs/data_disks/herc0b/users/zdhughes/simple.log") failed - errno 13 (Permission denied)
09/09/16 18:50:42 (81.1) (30541): WriteUserLog::initialize: failed to open file /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.1) (30541): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.1) (30541): Job 81.1 going into Hold state (code 22,0): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.1) (30541): RemoteResource::killStarter(): DCStartd object NULL!
09/09/16 18:50:42 (81.0) (30540): WriteUserLog::initialize: safe_open_wrapper("/nfs/data_disks/herc0b/users/zdhughes/simple.log") failed - errno 13 (Permission denied)
09/09/16 18:50:42 (81.0) (30540): WriteUserLog::initialize: failed to open file /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.0) (30540): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.0) (30540): Job 81.0 going into Hold state (code 22,0): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.0) (30540): RemoteResource::killStarter(): DCStartd object NULL!
09/09/16 18:50:42 ******************************************************

etc. for all 30 instances of the job. I have an entry for my the disk in my export file: 


/local_data0 herc*.lexas(rw) starscream.lexas(rw)


and I can read/write to the disk as a user. I have chmod 777 the directory zdhughes, which is were the program is located at and the files written to. Going deeper into the directory structure so that many parent directories also have full rwx access does nothing. Additionally, herc0 has a local account, labuser. When executing the (vanilla) job from its home directory the jobs on herc0 run normally (the jobs on other machines hold, as expected); but if I run the job locally from /local_data0/users/labuser/ and get the same thing:


09/09/16 19:06:20 (82.0) (31628): WriteUserLog::initialize: safe_open_wrapper("/local_data0/users/labuser/simple.log") failed - errno 13 (Permission denied)
09/09/16 19:06:20 (82.0) (31628): WriteUserLog::initialize: failed to open file /local_data0/users/labuser/simple.log
09/09/16 19:06:20 (82.0) (31628): Failed to initialize user log to /local_data0/users/labuser/simple.log
09/09/16 19:06:20 (82.0) (31628): Job 82.0 going into Hold state (code 22,0): Failed to initialize user log to /local_data0/users/labuser/simple.log
09/09/16 19:06:20 (82.0) (31628): RemoteResource::killStarter(): DCStartd object NULL!

Any ides?


Thanks,


Zach