[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor_submit stalls on "Logging submit event(s)." FileLock::obtain(1) errors in log files.



On Jan 20, 2011, at 1:51 PM, Shrader, Joshua H. wrote:

> We’ve had condor running well for the past year or so without any major issues, but after some recent network changes, condor_submit has begun hanging while “Logging submit event(s).”  It will hang for about 1 minute, and then seemingly work.
> 
> ShadowLog and SchedLog are both reporting 
> FileLock::obtain(1) failed – errno 5 (input/output error) 
> And
> FileLock::obtain(1) failed – errno 37 (No locks available)
> 
> We have a dedicated server that serves as a single submit node.  All users log into this submit node to submit jobs.  Previously, the directories to which the output, error, and log files were written were physically located on this machine.  They are now located on a SAN, and are accessed via NFS.


Josh,
	One that might help with NFS file locking is, "just say no". If you are willing to try the development branch of Condor take a look at NEW_LOCKING and CREATE_LOCKS_ON_LOCAL_DISK in the 7.5 release notes,
http://www.cs.wisc.edu/condor/manual/v7.5/8_2Development_Release.html

Alternatively, perhaps take a look at increasing the concurrency of lock requests on your NFS server, e.g., on Solaris increase LOCKD_SERVERS in /etc/default/nfs.

--
Stuart Anderson  anderson@xxxxxxxxxxxxxxxx
http://www.ligo.caltech.edu/~anderson