[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] ERROR “Assertion ERROR on (m_lock->isLocked())” at line 1312 in file read_user_log.cpp



One of the users of my cluster is having DAGMAN job failures.  It seems to consistently happen with certain job clusters, while others run to completion.  He gets the following error in dagman.log:
…Job was evicted.
(0)    Job was not checkpointed.

And in dagman out:
… ERROR “Assertion ERROR on (m_lock->isLocked())” at line 1312 in file read_user_log.cpp

I checked the SchedLog for when he ran the job, and found this:
5/25 07:35:02 (pid:7843) FileLock::obtain(1) failed - errno 121 (Remote I/O error)

I'm assuming this is some sort of NFS file locking problem, but I'm not clear on which file it's failing to get a lock.  The error message doesn't say.  I had him try moving his log files to a local filesystem but it didn't help.  Can anyone point me in the right direction?

-- 

David Brodbeck
System Administrator, Linguistics
University of Washington