[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Shadow processes not ending



On Thu, 7 Dec 2006, Todd Tannenbaum wrote:

Re the below - what version and platform are u on? I will guess v6.8.x and Linux, but if I guessed wrong please tell me.

Yup, you guessed right - 6.8.1 on Linux.

Does the below only happen when you have lots of running jobs, or even with just a few, or even with just one?

There are generally a few tens of jobs running on my pool, so I can't say right now what the behaviour is with just one or two jobs running. I'll try to investigate that further when a convenient opportunity presents itself.. I've also noticed the following error sometimes pops up in the ShadowLog at the same time as the FileLock errors, if it helps:

12/12 00:13:35 (201.18) (22856): ERROR "Can no longer talk to condor_starter <172.24.89.152:9625>" at line 123 in file NTreceivers.C

If the above does not help, or you cannot configure that way cuz of diskless nodes, you could get rid of shadowlog locking altogether by having each shadow write into its own log file instead of sharing one. To do this, remove (or comment out) SHADOW_LOCK and then change SHADOW_LOG to be something like
  SHADOW_LOG=/somewhere/shadowlog.$(pid)

All log and lock files are on a local disk, with the exception of the job log files (ie the "Log" file in the submit file) which is on NFS. Basically, our setup is that Condor itself is installed locally on each machine whilst all users' files are on NFS (which thus includes things like the submitted executable, and input/output files for each job). The behaviour seems to be the same for both standard and vanilla jobs, which are all we run. Could it be the log files for the individual jobs that are causing the problem? I've tried your ShadowLog.$(pid) suggestion, but that didn't seem to change anything.

Thanks for the suggestions.

Adam