[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor_submit stalls on "Logging submit event(s)." FileLock::obtain(1) errors in log files.



Josh,

which version of Condor are you running? (I assume one in the 7.4 or earlier series... )

We've spent quite some effort in the current 7.5 development series to improve on the locking and making it less problematic with NFS.

To cut a long story short: If your log file is located on NFS, locking is not guaranteed to work reliably and can lead to the errors your are seeing.

A workaround is to *not* have the log file on NFS (or any shared file system) but to use file transfer (for stdout and stderr it does not matter).

The 7.6 stable series will have an improved locking and those kinds of problems should go away.

The mentioned LOCK config variable is independent from the locking problem you are seeing; the directory defined by LOCK is for example used for the InstanceLock (which helps to guarantee that only one condor_master runs at a given time). It is ok, even necessary to guarantee locking reliability that the LOCK dir is on a local drive and not a shared network volume (explanation see above).

Thanks,
Cathrin


On 01/20/2011 03:51 PM, Shrader, Joshua H. wrote:
We’ve had condor running well for the past year or so without any major issues, but after some recent network changes, condor_submit has begun hanging while “Logging submit event(s).”  It will hang for about 1 minute, and then seemingly work.

ShadowLog and SchedLog are both reporting
FileLock::obtain(1) failed – errno 5 (input/output error)
And
FileLock::obtain(1) failed – errno 37 (No locks available)

We have a dedicated server that serves as a single submit node.  All users log into this submit node to submit jobs.  Previously, the directories to which the output, error, and log files were written were physically located on this machine.  They are now located on a SAN, and are accessed via NFS.

Jobs are submitted to the Java universe, and we do not use transfer_input_files and when_to_transfer (i.e. We are operating on a shared filesystem).

/tmp (obviously) is not shared between the submitting and executing machine, yet this is where the lock files reside (i.e. LOCK=/tmp/condorlock in the condor config file).  Can this be a problem?  Is there a way to determine what files are trying to be accessed when these FileLock::obtain() errors occur?  Is our network setup (output, error, and log files being written to an NFS-mounted volume) feasible?

Another thing that we’ve noticed is that after condor_submit completes (i.e. after hanging for about 1 minute while logging events), if we attempt a condor_rm on the job, condor_schedd gets stuck with an effective UID of the user executing the condor_rm.  condor_q becomes unresponsive.  Eventually, condor_schedd switches back to the condor user, and everything seems to go back to normal.

Any help is greatly appreciated.

Thanks,