[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Shadow processes not ending



Re the below:

It absolutely could be a problem locking (over nfs) the job's log file in the user's home directory.

You can tell condor to not bother file locking by putting into condor_config: ENABLE_USERLOG_LOCKING = FALSE There is no downside to doing this, assuming you do not have multiple jobs logging into the same job log file.  

Of course, even if the above works for Condor, you still have a messed up rpc.lockd or some such that could cause problems for other applications as well.   'Course you could always use a hawkeye plugin to monitor if nfs locking is broken on a given node, and then have Condor not run jobs that want a log file on said broken nodes...  ;-)

Hope this helps
Todd

---
Todd Tannenbaum
Dept of Computer Sciences
University of Wisconsin-Madison
..Sent from a Palm Treo 680...

-----Original Message-----

From:  Adam Thorn <alt36@xxxxxxxxxxxxxxxx>
Subj:  Re: [Condor-users] Shadow processes not ending
Date:  Tue Dec 12, 2006 5:38 am
Size:  2K
To:  Condor-Users Mail List <condor-users@xxxxxxxxxxx>

On Thu, 7 Dec 2006, Todd Tannenbaum wrote:

> Re the below - what version and platform are u on?  I will guess v6.8.x 
> and Linux, but if I guessed wrong please tell me.

Yup, you guessed right - 6.8.1 on Linux.

> Does the below only happen when you have lots of running jobs, or even 
> with just a few, or even with just one?

There are generally a few tens of jobs running on my pool, so I can't say 
right now what the behaviour is with just one or two jobs running. I'll 
try to investigate that further when a convenient opportunity presents 
itself.. I've also noticed the following error sometimes pops up in the 
ShadowLog at the same time as the FileLock errors, if it helps:

12/12 00:13:35 (201.18) (22856): ERROR "Can no longer talk to 
condor_starter <172.24.89.152:9625>" at line 123 in file NTreceivers.C

> If the above does not help, or you cannot configure that way cuz of 
> diskless nodes, you could get rid of shadowlog locking altogether by 
> having each shadow write into its own log file instead of sharing one. 
> To do this, remove (or comment out) SHADOW_LOCK and then change 
> SHADOW_LOG to be something like
>   SHADOW_LOG=/somewhere/shadowlog.$(pid)

All log and lock files are on a local disk, with the exception of the job 
log files (ie the "Log" file in the submit file) which is on NFS. 
Basically, our setup is that Condor itself is installed locally on each 
machine whilst all users' files are on NFS (which thus includes things 
like the submitted executable, and input/output files for each job). The 
behaviour seems to be the same for both standard and vanilla jobs, which 
are all we run. Could it be the log files for the individual jobs that are 
causing the problem? I've tried your ShadowLog.$(pid) suggestion, but that 
didn't seem to change anything.

Thanks for the suggestions.

Adam
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR