[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.9.2 hung schedd



On Wed, Jun 13, 2007 at 10:50:38AM -0500, Dan Bradley wrote:
> 
> 
> Steffen Grunewald wrote:
> 
> >On Mon, Jun 11, 2007 at 09:51:03AM -0500, Dan Bradley wrote:
> >  
> >
> >>It is normal for the schedd to temporarily show up as the user id of one 
> >>of the users with jobs in the queue, because the schedd switches user 
> >>ids in order to do some operations on the user's behalf.
> >>
> >>However, it is not normal for the schedd to get stuck in this state.  To 
> >>find out what is going on, I would suggest using 'gdb' to see the schedd 
> >>stack when it is in this state.  Example:
> >>
> >>$ gdb -p <pid of schedd>
> >>(gdb) where
> >>...
> >>(gdb) quit
> >>    
> >>
> >
> >(gdb) where
> >#0  0x00002b46f6b2b69a in fcntl () from /lib/libc.so.6
> >#1  0x000000000058ee53 in flock ()
> >#2  0x0000000000665261 in lock_file ()
> >#3  0x000000000060d7e9 in FileLock::obtain ()
> >#4  0x00000000005c6685 in UserLog::writeEvent ()
> >  
> >
> 
> Yes.  It's the file locking problem that Todd referred to.  Is the user 
> log for this user's job on NFS?
> 
> >If I now find out how to remove the bad guys from the queue (I cannot 
> >while condor_schedd hangs, and if there are bad guys, condor_schedd will hang 
> >immediately again)...
> >  
> >
> 
> You could use condor_qedit to change the value of UserLog for the 
> problematic jobs and then remove them.


You can also use lsof on the hung schedd processes to find the offending
file, move it to the side and restart schedd. That has worked for us in
the past.

-- 
Stuart Anderson  anderson@xxxxxxxxxxxxxxxx
http://www.ligo.caltech.edu/~anderson