[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] 'permission denied' brought some jobs to H status



Dear Condor experts:

 On svr019 (centos 6.6,  2.6.32-504.16.2.el6.x86_64 ) we are running ARC-CE 5.0.0 with condor-8.2.2-265643.x86_64 as backend , sometimes some jobs could get hold due to 'permission denied' problem:

svr019:/home/atlas/atlas003# condor_q -analyze 3155
svr019.gla.scotgrid.ac.uk : <130.209.239.19:56581> : svr019.gla.scotgrid.ac.uk
---
3155.000:  Request is held.

Hold reason: Error from slot1@node128: STARTER at 10.141.0.128 failed to send file(s) to <10.141.255.19:57731>; SHADOW at 10.141.255.19 failed to write to file /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon/_condor_stderr.aipanda063.cern.ch_15422080.0_1433368150: (errno 13) Permission denied

  This only happens to several jobs among hundreds, seems not to be a general security issue.

In the log I can see:

012 (3155.000.000) 06/04 02:55:16 Job was held.
        Error from slot1@node128: STARTER at 10.141.0.128 failed to send file(s) to <10.141.255.19:57731>; SHADOW at 10.141.255.19 failed to write to file /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon/_condor_stderr.aipanda063.cern.ch_15422080.0_1433368150: (errno 13) Permission denied
        Code 12 Subcode 13

The jdl for this job prepared by ARC is:

# HTCondor job description built by grid-manager
Executable = condorjob.sh
Input = /dev/null
Log = /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon/log
Output = /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon.comment
Error = /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon.comment
+NordugridQueue = condor_q2d
Description = arc_pilot
GetEnv = True
Universe = vanilla
Notification = Never
Requirements = (OpSys == "LINUX")
Priority = 0
x509userproxy = /var/spool/arc/jobstatus/job.dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon.proxy
request_cpus = 1
+JobTimeLimit = 172800
request_memory=4000
+JobMemoryLimit = 8192000
should_transfer_files = YES
When_to_transfer_output = ON_EXIT_OR_EVICT
Transfer_input_files =  /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon/.gahp_complete, /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon/runpilot3-wrapper.sh
Periodic_remove = FALSE || RemoteWallClockTime > JobTimeLimit || ResidentSetSize > JobMemoryLimit
Queue

  Any idea where the problem might be?

  Cheers,Gang