Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor_submit stalls on "Logging submit event(s)." FileLock::obtain(1) errors in log files.
- Date: Thu, 20 Jan 2011 17:33:54 -0500
- From: "Shrader, Joshua H." <Joshua.Shrader@xxxxxxxxxx>
- Subject: Re: [Condor-users] Condor_submit stalls on "Logging submit event(s)." FileLock::obtain(1) errors in log files.
Thanks Cathrin. That explains a lot. We are indeed running a 7.2 version,
and now looking forward to the 7.6 release :)
Josh
On 1/20/11 5:20 PM, "Cathrin Weiss" <cweiss@xxxxxxxxxxx> wrote:
> Josh,
>
> which version of Condor are you running? (I assume one in the 7.4 or
> earlier series... )
>
> We've spent quite some effort in the current 7.5 development series to
> improve on the locking and making it less problematic with NFS.
>
> To cut a long story short: If your log file is located on NFS, locking
> is not guaranteed to work reliably and can lead to the errors your are
> seeing.
>
> A workaround is to *not* have the log file on NFS (or any shared file
> system) but to use file transfer (for stdout and stderr it does not matter).
>
> The 7.6 stable series will have an improved locking and those kinds of
> problems should go away.
>
> The mentioned LOCK config variable is independent from the locking
> problem you are seeing; the directory defined by LOCK is for example
> used for the InstanceLock (which helps to guarantee that only one
> condor_master runs at a given time). It is ok, even necessary to
> guarantee locking reliability that the LOCK dir is on a local drive and
> not a shared network volume (explanation see above).
>
> Thanks,
> Cathrin
>
>
> On 01/20/2011 03:51 PM, Shrader, Joshua H. wrote:
>> We¹ve had condor running well for the past year or so without any major
>> issues, but after some recent network changes, condor_submit has begun
>> hanging while ³Logging submit event(s).² It will hang for about 1 minute,
>> and then seemingly work.
>>
>> ShadowLog and SchedLog are both reporting
>> FileLock::obtain(1) failed errno 5 (input/output error)
>> And
>> FileLock::obtain(1) failed errno 37 (No locks available)
>>
>> We have a dedicated server that serves as a single submit node. All users
>> log into this submit node to submit jobs. Previously, the directories to
>> which the output, error, and log files were written were physically located
>> on this machine. They are now located on a SAN, and are accessed via NFS.
>>
>> Jobs are submitted to the Java universe, and we do not use
>> transfer_input_files and when_to_transfer (i.e. We are operating on a shared
>> filesystem).
>>
>> /tmp (obviously) is not shared between the submitting and executing machine,
>> yet this is where the lock files reside (i.e. LOCK=/tmp/condorlock in the
>> condor config file). Can this be a problem? Is there a way to determine
>> what files are trying to be accessed when these FileLock::obtain() errors
>> occur? Is our network setup (output, error, and log files being written to
>> an NFS-mounted volume) feasible?
>>
>> Another thing that we¹ve noticed is that after condor_submit completes (i.e.
>> after hanging for about 1 minute while logging events), if we attempt a
>> condor_rm on the job, condor_schedd gets stuck with an effective UID of the
>> user executing the condor_rm. condor_q becomes unresponsive. Eventually,
>> condor_schedd switches back to the condor user, and everything seems to go
>> back to normal.
>>
>> Any help is greatly appreciated.
>>
>> Thanks,
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/