Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor 6.7.20, nfs, and chown problems

Date: Wed, 25 Oct 2006 20:13:40 -0400
From: Armen Babikyan <armenb@xxxxxxxxxx>
Subject: [Condor-users] Condor 6.7.20, nfs, and chown problems

Hi,

Below is the last part of Andreas Vetter's "[Condor-users] 6.8.0 and NFSProblem" post from 8/3/2006.

Was this issue ever resolved? Does it have anything to do with NFSeven? I have my jobs' logfiles on NFS, and I am having this problemfairly regularly, and it effectively paralyzes my grid....processors getclaimed, but the processes don't actually run, and the machines sit"Claimed" and "Idle" with 0.000 load averages. On the submitter, I alsoget a bunch of "condor_scheduniv_exec.30747.0"-like processes that hangin the process list. I try to make the machines go into "Unclaimed"states by deleting the jobs at the submitter, but that doesn't happenuntil after at least 10 or 15 minutes.

At that point, I am forced to shut down all Condor daemons, delete alllog/spool/execute directories, and start again. Suggestions? :-)

Incidently, to what extent do users setup Condor jobs that have multipleprocesses in the cluster writing to the same logfile? I'm usingCondor+DAG, and I'm taking care that no two processes are ever launchedfrom the same job file, and don't write to the same logfile. Are thereany subtleties with DAGman that I need to worry about in the case ofkeeping log files on NFS?


Any advice is greatly appreciated.  Thanks!

 - Armen

Andreas Vetter wrote:

Questions:
Is there anything wrong with my setup? I have a 6.7.6 on differentmachines, with different central manager, that use this NFS server. noproblems.
Is there a change in condor since 6.7.6 regarding chown-ing files to andfrom condor? I found in Schedlog of client (i change nambers to names):
8/3 10:04:25 (pid:29713) Error: Unable to chown'/home/condor/hosts/dc09/spool/cluster44.proc0.subproc0'
from condor to vetter.magic
8/3 10:04:25 (pid:29713) (44.0) Failed to chown/home/condor/hosts/dc09/spool/cluster44.proc0.subproc0 from condorto vetter.magic. Job may run into permissions problems when it starts.8/3 10:04:25 (pid:29713) Error: Unable to chown'/home/condor/hosts/dc09/spool/cluster44.proc0.subproc0.tmp' fromcondor to vetter.magic8/3 10:04:25 (pid:29713) (44.0) Failed to chown/home/condor/hosts/dc09/spool/cluster44.proc0.subproc0.tmp from condor tovetter.magic. Job may run into permissions problems when it starts.
8/3 10:04:25 (pid:29566) Starting add_shadow_birthdate(44.0)
8/3 10:04:25 (pid:29566) Started shadow for job 44.0 on"<132.187.47.29:18672>", (shadow pid = 29714)8/3 10:04:25 (pid:29566) Shadow pid 29714 for job 44.0 exited with status1008/3 10:04:25 (pid:29566) match (<1.2.3.4:18672>#1154544604#60) outof jobs (cluster id 44); relinquishing8/3 10:04:25 (pid:29566) Sent RELEASE_CLAIM to startd on<132.187.47.29:18672>8/3 10:04:25 (pid:29566) Match record (<1.2.3.4:18672>, 44, -1)deleted8/3 10:04:25 (pid:29722) Error: Unable to chown'/home/condor/hosts/dc09/spool/cluster44.proc0.subproc0'
from vetter to condor.condor
8/3 10:04:25 (pid:29722) (44.0) Failed to chown/home/condor/hosts/dc09/spool/cluster44.proc0.subproc0 from vetter tocondor.condor. User may run into permissions problems whenfetching sandbox.


--
Armen Babikyan
MIT Lincoln Laboratory

armenb@xxxxxxxxxx . 781-981-1796

Prev by Date: Re: [Condor-users] access denied when running vbscript jobs withcscript.exe
Next by Date: Re: [Condor-users] how to resrict job run time
Previous by thread: Re: [Condor-users] Jobs Not Suspending
Next by thread: [Condor-users] Possibility to submit part of a job cluster?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Condor 6.7.20, nfs, and chown problems