[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor loses permissions at random



We have a linux condor pool that runs the vanilla universe on a shared file system across several servers. We are using condor version 6.7.20.

 

Occasionally a user will submit a job that starts, gets evicted and makes several tries at one machine before successfully starting at another machine.

 

In almost all of these cases the culprit is due to PermissionDenied on the output files.

 

2/15 17:28:08 (fd:13) (pid:23771) Starting a VANILLA universe job with ID: 16935.0

2/15 17:28:08 (fd:13) (pid:23771) In OsProc::OsProc()

2/15 17:28:08 (fd:13) (pid:23771) Main job KillSignal: 15 (SIGTERM)

2/15 17:28:08 (fd:13) (pid:23771) Main job RmKillSignal: 15 (SIGTERM)

2/15 17:28:08 (fd:13) (pid:23771) Main job HoldKillSignal: 15 (SIGTERM)

2/15 17:28:08 (fd:13) (pid:23771) in VanillaProc::StartJob()

2/15 17:28:08 (fd:13) (pid:23771) in OsProc::StartJob()

2/15 17:28:08 (fd:13) (pid:23771) IWD: /work/mb_apps/fi/temp.mb.301.35/mbpa-3.0.1/JBoss-2.4.3_Tomcat-3.2.3/jboss/bin

2/15 17:28:08 (fd:13) (pid:23771) PRIV_CONDOR --> PRIV_USER at os_proc.C:232

2/15 17:28:08 (fd:14) (pid:23771) Input file: /dev/null

2/15 17:28:20 (fd:14) (pid:23771) Failed to open '/work/pre3/fes12/RetroDevelopment/repository/projects/Master/jobs/work/fes13modelEval-3/logs/.condor/co\

ndor.out' as standard output: Permission denied (errno 13)

2/15 17:28:20 (fd:14) (pid:23771) Doing CONDOR_ulog

2/15 17:28:20 (fd:14) (pid:23771) Failed to open '/work/pre3/fes12/RetroDevelopment/repository/projects/Master/jobs/work/fes13modelEval-3/logs/.condor/co\

ndor.err' as standard error: Permission denied (errno 13)

2/15 17:28:20 (fd:14) (pid:23771) Doing CONDOR_ulog

2/15 17:28:20 (fd:13) (pid:23771) Failed to open some/all of the std files...

2/15 17:28:20 (fd:13) (pid:23771) Aborting OsProc::StartJob.

2/15 17:28:20 (fd:13) (pid:23771) PRIV_USER --> PRIV_CONDOR at os_proc.C:257

2/15 17:28:20 (fd:13) (pid:23771) Failed to start job, exiting

2/15 17:28:20 (fd:13) (pid:23771) ShutdownFast all jobs.

2/15 17:28:20 (fd:13) (pid:23771) Got ShutdownFast when no jobs running.

 

 

 

These files are created on the shared system by condor, and when the user logs on they are able to modify the files themselves.

Furthermore, condor can generally write the log as soon as it tries a different machine. (The machine that produces the Permission Denied error and the machine that the job finally runs change from run to run).

 

This error occurs seemingly at random, as the user can run several similar jobs and only a small subset will have this problem.

 

Can anyone suggest what I should look at or do to better understand why these permission denied errors are occurring?

 

Is there any information I didn’t include in this email that could help you out?

 

 

Thanks,

Durban

This email and any files transmitted with it are confidential, proprietary
and intended solely for the individual or entity to whom they are addressed.
If you have received this email in error please delete it immediately.