Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.9.2 hung schedd

Date: Mon, 11 Jun 2007 09:51:03 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Condor 6.9.2 hung schedd

It is normal for the schedd to temporarily show up as the user id of oneof the users with jobs in the queue, because the schedd switches userids in order to do some operations on the user's behalf.

However, it is not normal for the schedd to get stuck in this state. Tofind out what is going on, I would suggest using 'gdb' to see the scheddstack when it is in this state. Example:


$ gdb -p <pid of schedd>
(gdb) where
...
(gdb) quit

--Dan

Steffen Grunewald wrote:

Hi,

for the n-th time, I found one of the schedds in my Condor pool dead.
(The machine is the "pool master" but otherwise all submit machines
= cluster headnodes are configured the same.)

A local condor_q fails (with the usual "Failed to fetch ads" message,
and yes, the port number is the same I checked with netstat -tlp).

The last lines in the SchedLog (kept locally) are:

6/8 22:14:54 (pid:16309) ZKM: setting default map to xxxx@xxxxxxxxxxxxxxxxxx
6/8 23:15:04 (pid:18901) ******************************************************
6/8 23:15:04 (pid:18901) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
6/8 23:15:04 (pid:18901) ** /opt/condor/sbin/condor_schedd
6/8 23:15:04 (pid:18901) ** $CondorVersion: 6.9.2 Apr  9 2007 $
6/8 23:15:04 (pid:18901) ** $CondorPlatform: X86_64-LINUX_RHEL3 $
6/8 23:15:04 (pid:18901) ** PID = 18901
6/8 23:15:04 (pid:18901) ** Log last touched 6/8 22:14:54
6/8 23:15:04 (pid:18901) ******************************************************
6/8 23:15:04 (pid:18901) Using config source: /etc/condor/condor_config
6/8 23:15:04 (pid:18901) Using local config sources:
6/8 23:15:04 (pid:18901)    /opt/condor/etc/condor_config.LINUX.X86_64
6/8 23:15:04 (pid:18901)    /home/condor/etc/xxxxxxx.local
6/8 23:15:04 (pid:18901) DaemonCore: Command Socket at <10.100.200.91:51074>
6/8 23:15:04 (pid:18901) History file rotation is enabled.
6/8 23:15:04 (pid:18901)   Maximum history file size is: 10000000 bytes
6/8 23:15:04 (pid:18901)   Number of rotated history files is: 10
6/8 23:15:04 (pid:18901) "/opt/condor/sbin/condor_shadow.pvm -classad" did not produce any output, ignoring

condor_restart -subs schedd shows no change in the log nor in behaviour.

A look at the process table shows that the corresponding condor_schedd
process is not owned by condor (as on all other submit machines) but by the
user who submitted a job cluster before the problem showed up.

How can this happen (and why doesn't it happen on other machines)?
Is there a config option I overlooked (but all machines are configured the
same way, except the "pool master" so I'd expect this behaviour on all
of them)?
A bug in 6.9.2?

Suggestions welcome - how to proceed, how to gather debugging information...

Ah, temporary fix: kill -6 $(pidof condor_schedd). Thereafter, a newcondor_schedd will start up (and for a short time be owned by the user

who still has some jobs in the queue; then going back to condor as should

be). don't know why -6 ... -9 probably would have done the same (in abrute force manner)...



Steffen

Follow-Ups:
- Re: [Condor-users] Condor 6.9.2 hung schedd
  - From: Steffen Grunewald
- Re: [Condor-users] Condor 6.9.2 hung schedd
  - From: Todd Tannenbaum

References:
- [Condor-users] Condor 6.9.2 hung schedd
  - From: Steffen Grunewald

Prev by Date: Re: [Condor-users] condor_submit -remote
Next by Date: Re: [Condor-users] Using Condor WebServices
Previous by thread: [Condor-users] Condor 6.9.2 hung schedd
Next by thread: Re: [Condor-users] Condor 6.9.2 hung schedd
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Condor 6.9.2 hung schedd