Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] FileLock::obtain error for all jobs

Date: Thu, 06 Dec 2007 09:02:25 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [Condor-users] FileLock::obtain error for all jobs

Ashutosh Mahajan wrote:

Hello everyone,
  We are running condor-6.8.2 on nearly 500 cores (< 200 machines) managed by
a central manager. /home is shared across all nodes over NFS. condor binaries
are also on NFS. but LOCAL_DIR is not on NFS (so log, spool, execute are not
on NFS). today we probably saw ALL jobs (vanilla, standard, parallel)
getting FileLock:obtain(1) or FileLock:obtain(2). the ShadowLog and
ShadowLog.old are full of lines like:

12/5 22:51:21 (13867.0) (3294):FileLock::obtain(2) failed - errno 9 (Bad file
descriptor)
12/5 22:51:21 (13867.0) (3294):********** Shadow Exiting(107) **********
12/5 22:51:22 (14206.0) (3010): Job 14206.0 terminated: exited with status 0
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(1) failed - errno 9 (Bad file
descriptor)
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(2) failed - errno 9 (Bad file
descriptor)
12/5 22:51:22 (14206.0) (3010): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 100
 ....

it is now back to normal and we dont know if and when will this happen again.

Around the same time, dmesg shows lot of segmentation faults, RPC/portmap
errors, several call traces etc happening around the same time. this may not
be happening for the first time, since a user complained of all her parallel
jobs getting disconnected and restarting for no apparent reason last week. I
have saved logs from some machines and the central manager after this event. i
can post them on the web if need be.

any suggestions will be very helpful. thanks in advance.


Some ideas:

1) When you submit jobs, if you specify a job event log (i.e. log =/some/path ), put the path for the log file someplace local and not on NFS.

2) on submit machines where users do not use DAGMan or other facilitiesthat rely upon the job event log ( log = /path/... ) in the submit file,put

   ENABLE_USERLOG_LOCKING = False
in the condor_config file.

3) upgrade to the current stable release -- lots of bugs/improvementshave been fixed/made between v6.8.2 and v6.8.7. note you could justupgrade your submit machine(s) if that makes it easier; all versions ofCondor within a given stable series are compatible over the network witheach other. So a submit machine running v6.8.7 would have no problemsworking with v6.8.2 machines in the rest of your pool.

We have a patch almost ready that will place lock files for the jobevent log on local disk automatically, which will render obsoleteworkarounds like the above that deal with NFS's poor file locking....but for know, the above items come to mind.


best,
Todd

References:
- [Condor-users] FileLock::obtain error for all jobs
  - From: Ashutosh Mahajan

Prev by Date: Re: [Condor-users] Windows 2003 Server R2
Next by Date: Re: [Condor-users] ClassAd syntax in condor_advertise
Previous by thread: [Condor-users] FileLock::obtain error for all jobs
Next by thread: [Condor-users] Windows 2003 Server R2
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] FileLock::obtain error for all jobs