[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] FileLock::obtain error for all jobs
- Date: Thu, 06 Dec 2007 09:02:25 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] FileLock::obtain error for all jobs
Ashutosh Mahajan wrote:
We are running condor-6.8.2 on nearly 500 cores (< 200 machines) managed by
a central manager. /home is shared across all nodes over NFS. condor binaries
are also on NFS. but LOCAL_DIR is not on NFS (so log, spool, execute are not
on NFS). today we probably saw ALL jobs (vanilla, standard, parallel)
getting FileLock:obtain(1) or FileLock:obtain(2). the ShadowLog and
ShadowLog.old are full of lines like:
12/5 22:51:21 (13867.0) (3294):FileLock::obtain(2) failed - errno 9 (Bad file
12/5 22:51:21 (13867.0) (3294):********** Shadow Exiting(107) **********
12/5 22:51:22 (14206.0) (3010): Job 14206.0 terminated: exited with status 0
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(1) failed - errno 9 (Bad file
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(2) failed - errno 9 (Bad file
12/5 22:51:22 (14206.0) (3010): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 100
it is now back to normal and we dont know if and when will this happen again.
Around the same time, dmesg shows lot of segmentation faults, RPC/portmap
errors, several call traces etc happening around the same time. this may not
be happening for the first time, since a user complained of all her parallel
jobs getting disconnected and restarting for no apparent reason last week. I
have saved logs from some machines and the central manager after this event. i
can post them on the web if need be.
any suggestions will be very helpful. thanks in advance.
1) When you submit jobs, if you specify a job event log (i.e. log =
/some/path ), put the path for the log file someplace local and not on NFS.
2) on submit machines where users do not use DAGMan or other facilities
that rely upon the job event log ( log = /path/... ) in the submit file,
ENABLE_USERLOG_LOCKING = False
in the condor_config file.
3) upgrade to the current stable release -- lots of bugs/improvements
have been fixed/made between v6.8.2 and v6.8.7. note you could just
upgrade your submit machine(s) if that makes it easier; all versions of
Condor within a given stable series are compatible over the network with
each other. So a submit machine running v6.8.7 would have no problems
working with v6.8.2 machines in the rest of your pool.
We have a patch almost ready that will place lock files for the job
event log on local disk automatically, which will render obsolete
workarounds like the above that deal with NFS's poor file locking....
but for know, the above items come to mind.