[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.9.2 hung schedd




Hi,

for the n-th time, I found one of the schedds in my Condor pool dead.
(The machine is the "pool master" but otherwise all submit machines
= cluster headnodes are configured the same.)


[snip]
A look at the process table shows that the corresponding condor_schedd
process is not owned by condor (as on all other submit machines) but by the
user who submitted a job cluster before the problem showed up.


Practically the only time the schedd switches its effective UID to that of submitting is when it is writing into the user log.

My guess:  did the user in question submit job(s) with
  log = /some/path
in the submit file, and "/some/path" is siting on an NFS server?

If so, welcome to the pathetic world of nfs file locks (esp on Linux).

A quick workaround would be to have the user place the log files onto a local disk volume, or to not specify a log file.

We need to improve this at some point - since most users only need locking across the processes on one machine (i.e. the schedd, shadow, gridmanager, and dagman all run on the same box), perhaps replacing the file lock with a kernel mutex. What do folks think?

hope this helps
Todd