[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor 6.7.20, nfs, and chown problems



Hi,

Below is the last part of Andreas Vetter's "[Condor-users] 6.8.0 and NFS Problem" post from 8/3/2006.

Was this issue ever resolved? Does it have anything to do with NFS even? I have my jobs' logfiles on NFS, and I am having this problem fairly regularly, and it effectively paralyzes my grid....processors get claimed, but the processes don't actually run, and the machines sit "Claimed" and "Idle" with 0.000 load averages. On the submitter, I also get a bunch of "condor_scheduniv_exec.30747.0"-like processes that hang in the process list. I try to make the machines go into "Unclaimed" states by deleting the jobs at the submitter, but that doesn't happen until after at least 10 or 15 minutes.

At that point, I am forced to shut down all Condor daemons, delete all log/spool/execute directories, and start again. Suggestions? :-)

Incidently, to what extent do users setup Condor jobs that have multiple processes in the cluster writing to the same logfile? I'm using Condor+DAG, and I'm taking care that no two processes are ever launched from the same job file, and don't write to the same logfile. Are there any subtleties with DAGman that I need to worry about in the case of keeping log files on NFS?

Any advice is greatly appreciated.  Thanks!

 - Armen

Andreas Vetter wrote:
Questions:
Is there anything wrong with my setup? I have a 6.7.6 on different machines, with different central manager, that use this NFS server. no problems.

Is there a change in condor since 6.7.6 regarding chown-ing files to and from condor? I found in Schedlog of client (i change nambers to names):

8/3 10:04:25 (pid:29713) Error: Unable to chown '/home/condor/hosts/dc09/spool/cluster44.proc0.subproc0'
from condor to vetter.magic
8/3 10:04:25 (pid:29713) (44.0) Failed to chown /home/condor/hosts/dc09/spool/cluster44.proc0.subproc0 from condor to vetter.magic. Job may run into permissions problems when it starts. 8/3 10:04:25 (pid:29713) Error: Unable to chown '/home/condor/hosts/dc09/spool/cluster44.proc0.subproc0.tmp' from condor to vetter.magic 8/3 10:04:25 (pid:29713) (44.0) Failed to chown /home/condor/hosts/dc09/spool/cluster44.proc0.subproc0.tmp from condor to vetter.magic. Job may run into permissions problems when it starts.
8/3 10:04:25 (pid:29566) Starting add_shadow_birthdate(44.0)
8/3 10:04:25 (pid:29566) Started shadow for job 44.0 on "<132.187.47.29:18672>", (shadow pid = 29714) 8/3 10:04:25 (pid:29566) Shadow pid 29714 for job 44.0 exited with status 100 8/3 10:04:25 (pid:29566) match (<1.2.3.4:18672>#1154544604#60) out of jobs (cluster id 44); relinquishing 8/3 10:04:25 (pid:29566) Sent RELEASE_CLAIM to startd on <132.187.47.29:18672> 8/3 10:04:25 (pid:29566) Match record (<1.2.3.4:18672>, 44, -1) deleted 8/3 10:04:25 (pid:29722) Error: Unable to chown '/home/condor/hosts/dc09/spool/cluster44.proc0.subproc0'
from vetter to condor.condor
8/3 10:04:25 (pid:29722) (44.0) Failed to chown /home/condor/hosts/dc09/spool/cluster44.proc0.subproc0 from vetter to condor.condor. User may run into permissions problems when fetching sandbox.




--
Armen Babikyan
MIT Lincoln Laboratory
armenb@xxxxxxxxxx . 781-981-1796