[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Cannot execute Job on remote host, permission denied to write condor_exec.exe



Hello All,  Does anyone have a suggestion for how to get past this issue?

When I submit from my negotiator host, and jobs can run on my
negotiator host, but if I force a job to run on some other machine (
not run on the negotiator)  in the job submission requirements, eg

( machine != "uskyarpds0310.air.ups.com" )

Then the job Runs for 2 seconds and goes into the Hold state.

condor_q -better says:

-- Submitter: uskyarpds0310.air.ups.com : <10.224.217.231:8452> :
uskyarpds0310.air.ups.com
---
2891.000:  Request is held.

Hold reason: Error from slot1@xxxxxxxxxxxxxxxxxxxxxxxxx: STARTER at
10.224.176.128 failed to write to file
/opt/condor/app/installation/local.compute-node/execute/dir_10143/condor_exec.exe:
(errno 13) Permission denied

-------------

In the logs I see:

== Shadow Log on submit machine ==

08/18 17:47:26 Initializing a VANILLA shadow for job 2891.0
08/18 17:47:27 (2891.0) (18621): Request to run on
slot1@xxxxxxxxxxxxxxxxxxxxxxxxx <10.224.176.128:50433> was ACCEPTED
08/18 17:47:28 (2891.0) (18621): DoUpload: (Condor error code 12,
subcode 13) SHADOW at 10.224.217.231 failed to send file(s) to
<10.224.176.128:53124>; STARTER at 10.224.176.128 failed to write to
file /opt/condor/app/installation/local.compute-node/execute/dir_10143/condor_exec.exe:
(errno 13) Permission denied
08/18 17:47:28 (2891.0) (18621): Job 2891.0 going into Hold state
(code 12,13): Error from slot1@xxxxxxxxxxxxxxxxxxxxxxxxx: STARTER at
10.224.176.128 failed to write to file
/opt/condor/app/installation/local.compute-node/execute/dir_10143/condor_exec.exe:
(errno 13) Permission denied
08/18 17:47:28 (2891.0) (18621): **** condor_shadow (condor_SHADOW)
pid 18621 EXITING WITH STATUS 112

== StarterLog.slot1 on the remote execute node ==

08/18 17:47:27 get_file(): Failed to open file
/opt/condor/app/installation/local.compute-node/execute/dir_10143/condor_exec.exe,
errno = 13: Permission denied.
08/18 17:47:28 get_file(): consumed 18446296 bytes of file transmission
08/18 17:47:28 DoDownload: consuming rest of transfer and failing
after encountering the following error: STARTER at 10.224.176.128
failed to write to file
/opt/condor/app/installation/local.compute-node/execute/dir_10143/condor_exec.exe:
(errno 13) Permission denied
08/18 17:47:28 WARNING: File
/opt/condor/app/installation/local.compute-node/execute/dir_10143/condor_exec.exe
can not be accessed by Quill file transfer tracking.
08/18 17:47:28 File transfer failed (status=0).
08/18 17:47:28 ERROR "Failed to transfer files" at line 1882 in file
jic_shadow.cpp

---------
Actually,  the failure to write the file to the execute sub dir
happens for all files transfered, not just the exe.   I see the same
block of messages in the StarterLog.slot1 for every file that is
specified in my submit file's transfer_input_files value

On the remote execute machine, the permissions for the directory

/opt/condor/app/installation/local.compute-node/execute/dir_10143/

were: (from ls -l )

drwxr-xr-x 2 nobody nobody 4096 Aug 18 17:47 dir_10143

To start condor, I call condor_master as root, and condor has a umask of 0077.

The filesystem has the following properties output from the  command: mount
/dev/mapper/vg00-lv_condor_app on /opt/condor/app type ext3 (rw)
It is a local filesystem, not NFS.

All machines are the same regarding:  x86_64,  running condor 7.4.2 on RHEL 5.5

Any requests for futher information or suggestions on how to track
down the problem would be greatly appreciated.

Thank You,

Lee