[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor_submit stalls on "Logging submit event(s)." FileLock::obtain(1) errors in log files.



Title: Condor_submit stalls on "Logging submit event(s)."  FileLock::obtain(1) errors in log files.
We’ve had condor running well for the past year or so without any major issues, but after some recent network changes, condor_submit has begun hanging while “Logging submit event(s).”  It will hang for about 1 minute, and then seemingly work.

ShadowLog and SchedLog are both reporting
FileLock::obtain(1) failed – errno 5 (input/output error)
And
FileLock::obtain(1) failed – errno 37 (No locks available)

We have a dedicated server that serves as a single submit node.  All users log into this submit node to submit jobs.  Previously, the directories to which the output, error, and log files were written were physically located on this machine.  They are now located on a SAN, and are accessed via NFS.

Jobs are submitted to the Java universe, and we do not use transfer_input_files and when_to_transfer (i.e. We are operating on a shared filesystem).

/tmp (obviously) is not shared between the submitting and executing machine, yet this is where the lock files reside (i.e. LOCK=/tmp/condorlock in the condor config file).  Can this be a problem?  Is there a way to determine what files are trying to be accessed when these FileLock::obtain() errors occur?  Is our network setup (output, error, and log files being written to an NFS-mounted volume) feasible?

Another thing that we’ve noticed is that after condor_submit completes (i.e. after hanging for about 1 minute while logging events), if we attempt a condor_rm on the job, condor_schedd gets stuck with an effective UID of the user executing the condor_rm.  condor_q becomes unresponsive.  Eventually, condor_schedd switches back to the condor user, and everything seems to go back to normal.

Any help is greatly appreciated.

Thanks,

Josh