[HTCondor-users] Write errors on secondary disk

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

So, an overview:

I have 3 machines in a condor cluster: herc0, herc1, and starscream. All of them mount home directories from a fourth, optimus. On each of the condor machines I have the home directories mounted at /nfs/optimus/home/ . herc0 has a secondary drive mounted locally as /local_data0 and also for all machines at /nfs/data_disks/herc0b .

Using the tutorial program, simple.c, I can successfully run the jobs in my home directory. All cores are used, all save write to disk. If I cd into a directory in /nfs/data_disks/herc0b I get the following errors submitting:

[zdhughes@herc0 zdhughes]$ condor_submit submit

Submitting job(s)..............................

30 job(s) submitted to cluster 81.

WARNING: File /nfs/data_disks/herc0b/users/zdhughes/simple.error is not writable by condor.

WARNING: File /nfs/data_disks/herc0b/users/zdhughes/simple.out is not writable by condor.

And the ShadowLog has:

09/09/16 18:50:42 ******************************************************

09/09/16 18:50:42 ** condor_shadow (CONDOR_SHADOW) STARTING UP

09/09/16 18:50:42 ** /usr/sbin/condor_shadow

09/09/16 18:50:42 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)

09/09/16 18:50:42 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON

09/09/16 18:50:42 ** $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $

09/09/16 18:50:42 ** $CondorPlatform: x86_64_RedHat7 $

09/09/16 18:50:42 ** PID = 30540

09/09/16 18:50:42 ** Log last touched 9/9 18:47:43

09/09/16 18:50:42 ******************************************************

09/09/16 18:50:42 Using config source: /etc/condor/condor_config

09/09/16 18:50:42 Using local config sources:

09/09/16 18:50:42 /etc/condor/condor_config.local

09/09/16 18:50:42 config Macros = 71, Sorted = 71, StringBytes = 1828, TablesBytes = 1176

09/09/16 18:50:42 ******************************************************

09/09/16 18:50:42 CLASSAD_CACHING is OFF

09/09/16 18:50:42 ** condor_shadow (CONDOR_SHADOW) STARTING UP

09/09/16 18:50:42 ** /usr/sbin/condor_shadow

09/09/16 18:50:42 Daemon Log is logging: D_ALWAYS D_ERROR

09/09/16 18:50:42 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)

09/09/16 18:50:42 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON

09/09/16 18:50:42 ** $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $

09/09/16 18:50:42 ** $CondorPlatform: x86_64_RedHat7 $

09/09/16 18:50:42 ** PID = 30541

09/09/16 18:50:42 ** Log last touched 9/9 18:50:42

09/09/16 18:50:42 ******************************************************

09/09/16 18:50:42 Using config source: /etc/condor/condor_config

09/09/16 18:50:42 Using local config sources:

09/09/16 18:50:42 /etc/condor/condor_config.local

09/09/16 18:50:42 config Macros = 71, Sorted = 71, StringBytes = 1828, TablesBytes = 1176

09/09/16 18:50:42 CLASSAD_CACHING is OFF

09/09/16 18:50:42 Daemon Log is logging: D_ALWAYS D_ERROR

09/09/16 18:50:42 Daemoncore: Listening at <0.0.0.0:19411> on TCP (ReliSock).

09/09/16 18:50:42 Daemoncore: Listening at <0.0.0.0:38254> on TCP (ReliSock).

09/09/16 18:50:42 DaemonCore: command socket at <10.0.7.10:19411?addrs=10.0.7.10-19411&noUDP>

09/09/16 18:50:42 DaemonCore: command socket at <10.0.7.10:38254?addrs=10.0.7.10-38254&noUDP>

09/09/16 18:50:42 DaemonCore: private command socket at <10.0.7.10:19411?addrs=10.0.7.10-19411>

09/09/16 18:50:42 DaemonCore: private command socket at <10.0.7.10:38254?addrs=10.0.7.10-38254>

09/09/16 18:50:42 Initializing a VANILLA shadow for job 81.1

09/09/16 18:50:42 Initializing a VANILLA shadow for job 81.0

09/09/16 18:50:42 (81.1) (30541): WriteUserLog::initialize: safe_open_wrapper("/nfs/data_disks/herc0b/users/zdhughes/simple.log") failed - errno 13 (Permission denied)

09/09/16 18:50:42 (81.1) (30541): WriteUserLog::initialize: failed to open file /nfs/data_disks/herc0b/users/zdhughes/simple.log

09/09/16 18:50:42 (81.1) (30541): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log

09/09/16 18:50:42 (81.1) (30541): Job 81.1 going into Hold state (code 22,0): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log

09/09/16 18:50:42 (81.1) (30541): RemoteResource::killStarter(): DCStartd object NULL!

09/09/16 18:50:42 (81.0) (30540): WriteUserLog::initialize: safe_open_wrapper("/nfs/data_disks/herc0b/users/zdhughes/simple.log") failed - errno 13 (Permission denied)

09/09/16 18:50:42 (81.0) (30540): WriteUserLog::initialize: failed to open file /nfs/data_disks/herc0b/users/zdhughes/simple.log

09/09/16 18:50:42 (81.0) (30540): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log

09/09/16 18:50:42 (81.0) (30540): Job 81.0 going into Hold state (code 22,0): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log

09/09/16 18:50:42 (81.0) (30540): RemoteResource::killStarter(): DCStartd object NULL!

09/09/16 18:50:42 ******************************************************

etc. for all 30 instances of the job. I have an entry for my the disk in my export file:

/local_data0 herc*.lexas(rw) starscream.lexas(rw)

and I can read/write to the disk as a user. I have chmod 777 the directory zdhughes, which is were the program is located at and the files written to. Going deeper into the directory structure so that many parent directories also have full rwx access does nothing. Additionally, herc0 has a local account, labuser. When executing the (vanilla) job from its home directory the jobs on herc0 run normally (the jobs on other machines hold, as expected); but if I run the job locally from /local_data0/users/labuser/ and get the same thing:

09/09/16 19:06:20 (82.0) (31628): WriteUserLog::initialize: safe_open_wrapper("/local_data0/users/labuser/simple.log") failed - errno 13 (Permission denied)

09/09/16 19:06:20 (82.0) (31628): WriteUserLog::initialize: failed to open file /local_data0/users/labuser/simple.log

09/09/16 19:06:20 (82.0) (31628): Failed to initialize user log to /local_data0/users/labuser/simple.log

09/09/16 19:06:20 (82.0) (31628): Job 82.0 going into Hold state (code 22,0): Failed to initialize user log to /local_data0/users/labuser/simple.log

09/09/16 19:06:20 (82.0) (31628): RemoteResource::killStarter(): DCStartd object NULL!

Any ides?

Thanks,

Zach

Mailing List Archives

Public Access

[HTCondor-users] Write errors on secondary disk