[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Shadow processes not ending



Hi Adam -

Re the below -  what version and platform are u on?  I will guess v6.8.x and Linux, but if I guessed wrong please tell me.

Does the below only happen when you have lots of running jobs, or even with just a few, or even with just one?

After reading the below, my first thought is file locking onto nfs, which is notoriously bad esp on anything other than solaris.  Is your ShadowLog file going onto nfs?  More importantly, where is Condor placing its lock files?  Look at the value of LOCK_DIR in the config file and make certain it is pointing to a local filesystem and not someplace on nfs (setting LOCK_DIR=/tmp is often a reasonable choice).  Make certain that SHADOW_LOCK is defined in terms of LOCK_DIR.

If the above does not help, or you cannot configure that way cuz of diskless nodes, you could get rid of shadowlog locking altogether by having each shadow write into its own log file instead of sharing one.  To do this, remove (or comment out) SHADOW_LOCK and then change SHADOW_LOG to be something like
   SHADOW_LOG=/somewhere/shadowlog.$(pid)

You will probably want to setup a cron job to periodically clean out old files from /somewhere.

Hope this helps,
Todd

---
Todd Tannenbaum
Dept of Computer Sciences
University of Wisconsin-Madison
..Sent from a Palm Treo 680...

-----Original Message-----

From:  Adam Thorn <alt36@xxxxxxxxxxxxxxxx>
Subj:  [Condor-users] Shadow processes not ending
Date:  Thu Dec 7, 2006 4:54 am
Size:  1K
To:  Condor-Users Mail List <condor-users@xxxxxxxxxxx>

I've noticed that often, after job termination the condor_shadow processes 
hang around even though the jobs they were shadowing finished hours 
previously. My ShadowLog has lots of the following:

12/7 04:36:12 (173.16) (5050): Job 173.16 terminated: exited with status 0
12/7 04:36:12 (173.16) (5050): FileLock::obtain(1) failed - errno 37 (No 
locks available)
12/7 04:36:12 (173.16) (5050): **** condor_shadow (condor_SHADOW) EXITING 
WITH STATUS 100

Similar FileLock errors often appear immediately after the job log shows 
that a job begins to execute, and they also seem to coincide with many 
other events in the job log - for example, when the image size is updated, 
or the job is evicted, there is often a FileLock error in the ShadowLog at 
the same time. Any idea what's going on?

Adam
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR