Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Shadow processes not ending

Date: Tue, 12 Dec 2006 11:27:59 +0000 (GMT)
From: Adam Thorn <alt36@xxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Shadow processes not ending

On Thu, 7 Dec 2006, Todd Tannenbaum wrote:

Re the below - what version and platform are u on? I will guess v6.8.xand Linux, but if I guessed wrong please tell me.


Yup, you guessed right - 6.8.1 on Linux.

Does the below only happen when you have lots of running jobs, or evenwith just a few, or even with just one?

There are generally a few tens of jobs running on my pool, so I can't sayright now what the behaviour is with just one or two jobs running. I'lltry to investigate that further when a convenient opportunity presentsitself.. I've also noticed the following error sometimes pops up in theShadowLog at the same time as the FileLock errors, if it helps:

12/12 00:13:35 (201.18) (22856): ERROR "Can no longer talk tocondor_starter <172.24.89.152:9625>" at line 123 in file NTreceivers.C

If the above does not help, or you cannot configure that way cuz ofdiskless nodes, you could get rid of shadowlog locking altogether byhaving each shadow write into its own log file instead of sharing one.To do this, remove (or comment out) SHADOW_LOCK and then changeSHADOW_LOG to be something like
  SHADOW_LOG=/somewhere/shadowlog.$(pid)

All log and lock files are on a local disk, with the exception of the joblog files (ie the "Log" file in the submit file) which is on NFS.Basically, our setup is that Condor itself is installed locally on eachmachine whilst all users' files are on NFS (which thus includes thingslike the submitted executable, and input/output files for each job). Thebehaviour seems to be the same for both standard and vanilla jobs, whichare all we run. Could it be the log files for the individual jobs that arecausing the problem? I've tried your ShadowLog.$(pid) suggestion, but thatdidn't seem to change anything.


Thanks for the suggestions.

Adam

References:
- Re: [Condor-users] Shadow processes not ending
  - From: Todd Tannenbaum

Prev by Date: [Condor-users] Priority issue
Next by Date: Re: [Condor-users] Priority issue
Previous by thread: Re: [Condor-users] Shadow processes not ending
Next by thread: Re: [Condor-users] Shadow processes not ending
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Shadow processes not ending