[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor Daemons Fail to run on node



Ian Chesal wrote:
On Thu, Dec 2, 2010 at 1:13 PM, Xenia Fave <xfave2008@xxxxxxxxxx <mailto:xfave2008@xxxxxxxxxx>> wrote:

    Do you mean just rebooting the one node or the entire cluster?


Just the one node where Condor won't start.

See the other email from James Burnash about fsck'ing the file system -- in order to do this you'll have to unmount it from *all* your machines.

If it's mounted on other machines: looks like everyone has a local /scratch.

As I recall (haven't seen it in a while) this can error happen when the disk develops too many bad sectors too fast. Then the filesystem gets ro'ed at a lower level than mtab, so mount still shows it as "rw". If that is the case, smartctl and/or dmesg (or /var/log/messages) should have something to say about it. Also, if this is the cause of the problem, don't bother with fsck, replace the disk.

Dimitri
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu