Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] FileLock::obtain error for all jobs

Date: Thu, 6 Dec 2007 00:47:27 -0500
From: Ashutosh Mahajan <asm4@xxxxxxxxxx>
Subject: [Condor-users] FileLock::obtain error for all jobs

Hello everyone,
  We are running condor-6.8.2 on nearly 500 cores (< 200 machines) managed by
a central manager. /home is shared across all nodes over NFS. condor binaries
are also on NFS. but LOCAL_DIR is not on NFS (so log, spool, execute are not
on NFS). today we probably saw ALL jobs (vanilla, standard, parallel)
getting FileLock:obtain(1) or FileLock:obtain(2). the ShadowLog and
ShadowLog.old are full of lines like:

12/5 22:51:21 (13867.0) (3294):FileLock::obtain(2) failed - errno 9 (Bad file
descriptor)
12/5 22:51:21 (13867.0) (3294):********** Shadow Exiting(107) **********
12/5 22:51:22 (14206.0) (3010): Job 14206.0 terminated: exited with status 0
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(1) failed - errno 9 (Bad file
descriptor)
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(2) failed - errno 9 (Bad file
descriptor)
12/5 22:51:22 (14206.0) (3010): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 100
 ....

it is now back to normal and we dont know if and when will this happen again.

Around the same time, dmesg shows lot of segmentation faults, RPC/portmap
errors, several call traces etc happening around the same time. this may not
be happening for the first time, since a user complained of all her parallel
jobs getting disconnected and restarting for no apparent reason last week. I
have saved logs from some machines and the central manager after this event. i
can post them on the web if need be.

any suggestions will be very helpful. thanks in advance.

--
regards
Ashutosh Mahajan
http://www.lehigh.edu/~asm4

Follow-Ups:
- Re: [Condor-users] FileLock::obtain error for all jobs
  - From: Todd Tannenbaum

Prev by Date: Re: [Condor-users] Windows, Credd, and run_as_owner question
Next by Date: [Condor-users] Windows 2003 Server R2
Previous by thread: Re: [Condor-users] Windows, Credd, and run_as_owner question
Next by thread: Re: [Condor-users] FileLock::obtain error for all jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] FileLock::obtain error for all jobs