[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] strange schedd problem



On 11/11/07, Maxim kovgan <maxim.kvg@xxxxxxxxx> wrote:
> Hello, list.
>
> We are using "stock" condor 6.8.6 version.
>
> 2 central managers are in HAD couple,
> one is SL3, other Debian Etch. each running the correct version for its
> arch, and distribution.
>
> We have 2 schedd's.
> One is used for production, another is only a backup.
>
> We are currently experiencing the following problem:
>
> the main schedd has started running under the user of the user that uses
> the pool the most, i.e. has the highest value of "Effective Priority"
>
<...>

This doesn't have anything to do with effective priority, that's just
a coincidence.

>
>
> The interesting thing is that in the mailing list archives I've found a
> very similar problem to mine:
> https://lists.cs.wisc.edu/archive/condor-users/2007-January/msg00201.shtml
>
> Unfortunately it refers to a yet another similar report, and without a
> satisfying solution:
>
> The "solution" was:
> "flushing the spool directory for the central manger"
>
>
> Is there a solution for such symptom, except deleting the spool ?
>
>

Do you have NFS involved somewhere? The schedd will temporarily switch
to a different user to write log files or copy data around, if
something blocks it, it could get stuck waiting to finish that
operation. Can you shut the schedd down? If you start it back up, does
the problem go away?

You need to tell us more about how you're observing this. Are you
looking at top output, or ps, or the log files? How long has it been
happening for now?

What does
ps axo cmd,pid,ppid,args,euid,ruid,fuid

print out?

Log files would be helpful here. Shut the schedd down, and start it
back up, and send or post the log file somewhere. It's important that
the logfile include the time when the schedd starts.

-Erik