[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] FileLock::obtain error for all jobs



Ashutosh Mahajan wrote:
Hello everyone,
  We are running condor-6.8.2 on nearly 500 cores (< 200 machines) managed by
a central manager. /home is shared across all nodes over NFS. condor binaries
are also on NFS. but LOCAL_DIR is not on NFS (so log, spool, execute are not
on NFS). today we probably saw ALL jobs (vanilla, standard, parallel)
getting FileLock:obtain(1) or FileLock:obtain(2). the ShadowLog and
ShadowLog.old are full of lines like:

12/5 22:51:21 (13867.0) (3294):FileLock::obtain(2) failed - errno 9 (Bad file
descriptor)
12/5 22:51:21 (13867.0) (3294):********** Shadow Exiting(107) **********
12/5 22:51:22 (14206.0) (3010): Job 14206.0 terminated: exited with status 0
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(1) failed - errno 9 (Bad file
descriptor)
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(2) failed - errno 9 (Bad file
descriptor)
12/5 22:51:22 (14206.0) (3010): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 100
 ....

it is now back to normal and we dont know if and when will this happen again.

Around the same time, dmesg shows lot of segmentation faults, RPC/portmap
errors, several call traces etc happening around the same time. this may not
be happening for the first time, since a user complained of all her parallel
jobs getting disconnected and restarting for no apparent reason last week. I
have saved logs from some machines and the central manager after this event. i
can post them on the web if need be.

any suggestions will be very helpful. thanks in advance.


Some ideas:

1) When you submit jobs, if you specify a job event log (i.e. log = /some/path ), put the path for the log file someplace local and not on NFS.

2) on submit machines where users do not use DAGMan or other facilities that rely upon the job event log ( log = /path/... ) in the submit file, put
   ENABLE_USERLOG_LOCKING = False
in the condor_config file.

3) upgrade to the current stable release -- lots of bugs/improvements have been fixed/made between v6.8.2 and v6.8.7. note you could just upgrade your submit machine(s) if that makes it easier; all versions of Condor within a given stable series are compatible over the network with each other. So a submit machine running v6.8.7 would have no problems working with v6.8.2 machines in the rest of your pool.

We have a patch almost ready that will place lock files for the job event log on local disk automatically, which will render obsolete workarounds like the above that deal with NFS's poor file locking.... but for know, the above items come to mind.

best,
Todd