[Condor-users] Problems with version 7.4.2 and 7.4..3


A while ago we had installed condor version 7.4..2 in our pool but couldn't get it to work. We have now upgraded to version 7.4.3, but the problems remain:

When one of our users submits a job, the command condor_submit gets stuck. The job is then marked as running, but it doesn't actually run (when we check the machine where it is supposed to be running, there's nothing there). When we try to kill the job it goes into the X state and using -forcex causes the sched to crash (condor_q stops working).

Apparently this doesn't always happen. Sometimes it happens to the first job submitted, other times to the second...

Checking the log files, we found the following error message in the Shadowlog of our DedicatedSchedduler/central Manager:

09/27 18:28:08 (1.0) (24662): FileTransfer::Init(): mkdir(/usr/local/condor/spool/cluster1.proc0.subproc0) failed, Permission denied (errno: 13)

This seems strange to me, because upon instalation, the directory /usr/local/condor/spool is created with the permissions drwxr-xr-x, why does condor then attempt to write in this directories and is not able to do it? Moreover, sometimes, in spite of this error, the jobs run fine, what makes me suspect that this may not be the only problem.

Can anyone help?

Thanks in advance

Diana Lousa

Diana Lousa
PhD student
Protein Modeling Laboratory
Oeiras, Portugal