[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] FileLock::obtain error for all jobs

Hello everyone,
  We are running condor-6.8.2 on nearly 500 cores (< 200 machines) managed by
a central manager. /home is shared across all nodes over NFS. condor binaries
are also on NFS. but LOCAL_DIR is not on NFS (so log, spool, execute are not
on NFS). today we probably saw ALL jobs (vanilla, standard, parallel)
getting FileLock:obtain(1) or FileLock:obtain(2). the ShadowLog and
ShadowLog.old are full of lines like:

12/5 22:51:21 (13867.0) (3294):FileLock::obtain(2) failed - errno 9 (Bad file
12/5 22:51:21 (13867.0) (3294):********** Shadow Exiting(107) **********
12/5 22:51:22 (14206.0) (3010): Job 14206.0 terminated: exited with status 0
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(1) failed - errno 9 (Bad file
12/5 22:51:22 (14206.0) (3010): FileLock::obtain(2) failed - errno 9 (Bad file
12/5 22:51:22 (14206.0) (3010): **** condor_shadow (condor_SHADOW) EXITING

it is now back to normal and we dont know if and when will this happen again.

Around the same time, dmesg shows lot of segmentation faults, RPC/portmap
errors, several call traces etc happening around the same time. this may not
be happening for the first time, since a user complained of all her parallel
jobs getting disconnected and restarting for no apparent reason last week. I
have saved logs from some machines and the central manager after this event. i
can post them on the web if need be.

any suggestions will be very helpful. thanks in advance.

Ashutosh Mahajan