[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] FileLock::obtain error for all jobs



Hello everyone,
  We are running condor-6.8.2 on nearly 500 cores (< 200 machines) managed by
a central manager. /home is shared across all nodes over NFS. condor binaries
are also on NFS. but LOCAL_DIR is not on NFS (so log, spool, execute are not
on NFS). today we probably saw ALL jobs (vanilla, standard, parallel)
getting FileLock:obtain(1) or FileLock:obtain(2). the ShadowLog and
ShadowLog.old are full of lines like:

12/5 22:51:21 (13867.0) (3294):FileLock::obtain(2) failed - errno 9 (Bad file
descriptor)
12/5 22:51:21 (13867.0) (3294):********** Shadow Exiting(107) **********
12/5 22:51:22 (14206.0) (3010): Job 14206.0 terminated: exited with status 0
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(1) failed - errno 9 (Bad file
descriptor)
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(2) failed - errno 9 (Bad file
descriptor)
12/5 22:51:22 (14206.0) (3010): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 100
 ....

it is now back to normal and we dont know if and when will this happen again.

Around the same time, dmesg shows lot of segmentation faults, RPC/portmap
errors, several call traces etc happening around the same time. this may not
be happening for the first time, since a user complained of all her parallel
jobs getting disconnected and restarting for no apparent reason last week. I
have saved logs from some machines and the central manager after this event. i
can post them on the web if need be.

any suggestions will be very helpful. thanks in advance.

--
regards
Ashutosh Mahajan
http://www.lehigh.edu/~asm4