Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.9.2 hung schedd

Date: Wed, 13 Jun 2007 08:59:59 -0700
From: Stuart Anderson <anderson@xxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Condor 6.9.2 hung schedd

On Wed, Jun 13, 2007 at 10:50:38AM -0500, Dan Bradley wrote:
> 
> 
> Steffen Grunewald wrote:
> 
> >On Mon, Jun 11, 2007 at 09:51:03AM -0500, Dan Bradley wrote:
> >  
> >
> >>It is normal for the schedd to temporarily show up as the user id of one 
> >>of the users with jobs in the queue, because the schedd switches user 
> >>ids in order to do some operations on the user's behalf.
> >>
> >>However, it is not normal for the schedd to get stuck in this state.  To 
> >>find out what is going on, I would suggest using 'gdb' to see the schedd 
> >>stack when it is in this state.  Example:
> >>
> >>$ gdb -p <pid of schedd>
> >>(gdb) where
> >>...
> >>(gdb) quit
> >>    
> >>
> >
> >(gdb) where
> >#0  0x00002b46f6b2b69a in fcntl () from /lib/libc.so.6
> >#1  0x000000000058ee53 in flock ()
> >#2  0x0000000000665261 in lock_file ()
> >#3  0x000000000060d7e9 in FileLock::obtain ()
> >#4  0x00000000005c6685 in UserLog::writeEvent ()
> >  
> >
> 
> Yes.  It's the file locking problem that Todd referred to.  Is the user 
> log for this user's job on NFS?
> 
> >If I now find out how to remove the bad guys from the queue (I cannot 
> >while condor_schedd hangs, and if there are bad guys, condor_schedd will hang 
> >immediately again)...
> >  
> >
> 
> You could use condor_qedit to change the value of UserLog for the 
> problematic jobs and then remove them.


You can also use lsof on the hung schedd processes to find the offending
file, move it to the side and restart schedd. That has worked for us in
the past.

-- 
Stuart Anderson  anderson@xxxxxxxxxxxxxxxx
http://www.ligo.caltech.edu/~anderson

Follow-Ups:
- Re: [Condor-users] Condor 6.9.2 hung schedd
  - From: Steffen Grunewald

References:
- [Condor-users] Condor 6.9.2 hung schedd
  - From: Steffen Grunewald
- Re: [Condor-users] Condor 6.9.2 hung schedd
  - From: Dan Bradley
- Re: [Condor-users] Condor 6.9.2 hung schedd
  - From: Steffen Grunewald
- Re: [Condor-users] Condor 6.9.2 hung schedd
  - From: Dan Bradley

Prev by Date: Re: [Condor-users] PERMISSION DENIED for command 60011 (DC_NOP)
Next by Date: [Condor-users] Credd problems with certain accounts
Previous by thread: Re: [Condor-users] Condor 6.9.2 hung schedd
Next by thread: Re: [Condor-users] Condor 6.9.2 hung schedd
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Condor 6.9.2 hung schedd