[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor_submit stalls on "Logging submit event(s)." FileLock::obtain(1) errors in log files.



Thanks Cathrin.  That explains a lot.  We are indeed running a 7.2 version,
and now looking forward to the 7.6 release :)

Josh


On 1/20/11 5:20 PM, "Cathrin Weiss" <cweiss@xxxxxxxxxxx> wrote:

> Josh,
> 
> which version of Condor are you running? (I assume one in the 7.4 or
> earlier series... )
> 
> We've spent quite some effort in the current 7.5 development series to
> improve on the locking and making it less problematic with NFS.
> 
> To cut a long story short: If your log file is located on NFS, locking
> is not guaranteed to work reliably and can lead to the errors your are
> seeing.
> 
> A workaround is to *not* have the log file on NFS (or any shared file
> system) but to use file transfer (for stdout and stderr it does not matter).
> 
> The 7.6 stable series will have an improved locking and those kinds of
> problems should go away.
> 
> The mentioned LOCK config variable is independent from the locking
> problem you are seeing; the directory defined by LOCK is for example
> used for the InstanceLock (which helps to guarantee that only one
> condor_master runs at a given time). It is ok, even necessary to
> guarantee locking reliability that the LOCK dir is on a local drive and
> not a shared network volume (explanation see above).
> 
> Thanks,
> Cathrin
> 
> 
> On 01/20/2011 03:51 PM, Shrader, Joshua H. wrote:
>> We¹ve had condor running well for the past year or so without any major
>> issues, but after some recent network changes, condor_submit has begun
>> hanging while ³Logging submit event(s).²  It will hang for about 1 minute,
>> and then seemingly work.
>> 
>> ShadowLog and SchedLog are both reporting
>> FileLock::obtain(1) failed ­ errno 5 (input/output error)
>> And
>> FileLock::obtain(1) failed ­ errno 37 (No locks available)
>> 
>> We have a dedicated server that serves as a single submit node.  All users
>> log into this submit node to submit jobs.  Previously, the directories to
>> which the output, error, and log files were written were physically located
>> on this machine.  They are now located on a SAN, and are accessed via NFS.
>> 
>> Jobs are submitted to the Java universe, and we do not use
>> transfer_input_files and when_to_transfer (i.e. We are operating on a shared
>> filesystem).
>> 
>> /tmp (obviously) is not shared between the submitting and executing machine,
>> yet this is where the lock files reside (i.e. LOCK=/tmp/condorlock in the
>> condor config file).  Can this be a problem?  Is there a way to determine
>> what files are trying to be accessed when these FileLock::obtain() errors
>> occur?  Is our network setup (output, error, and log files being written to
>> an NFS-mounted volume) feasible?
>> 
>> Another thing that we¹ve noticed is that after condor_submit completes (i.e.
>> after hanging for about 1 minute while logging events), if we attempt a
>> condor_rm on the job, condor_schedd gets stuck with an effective UID of the
>> user executing the condor_rm.  condor_q becomes unresponsive.  Eventually,
>> condor_schedd switches back to the condor user, and everything seems to go
>> back to normal.
>> 
>> Any help is greatly appreciated.
>> 
>> Thanks,
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/