[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.9.2 hung schedd




It is normal for the schedd to temporarily show up as the user id of one of the users with jobs in the queue, because the schedd switches user ids in order to do some operations on the user's behalf.

However, it is not normal for the schedd to get stuck in this state. To find out what is going on, I would suggest using 'gdb' to see the schedd stack when it is in this state. Example:

$ gdb -p <pid of schedd>
(gdb) where
...
(gdb) quit

--Dan

Steffen Grunewald wrote:

Hi,

for the n-th time, I found one of the schedds in my Condor pool dead.
(The machine is the "pool master" but otherwise all submit machines
= cluster headnodes are configured the same.)

A local condor_q fails (with the usual "Failed to fetch ads" message,
and yes, the port number is the same I checked with netstat -tlp).

The last lines in the SchedLog (kept locally) are:

6/8 22:14:54 (pid:16309) ZKM: setting default map to xxxx@xxxxxxxxxxxxxxxxxx
6/8 23:15:04 (pid:18901) ******************************************************
6/8 23:15:04 (pid:18901) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
6/8 23:15:04 (pid:18901) ** /opt/condor/sbin/condor_schedd
6/8 23:15:04 (pid:18901) ** $CondorVersion: 6.9.2 Apr  9 2007 $
6/8 23:15:04 (pid:18901) ** $CondorPlatform: X86_64-LINUX_RHEL3 $
6/8 23:15:04 (pid:18901) ** PID = 18901
6/8 23:15:04 (pid:18901) ** Log last touched 6/8 22:14:54
6/8 23:15:04 (pid:18901) ******************************************************
6/8 23:15:04 (pid:18901) Using config source: /etc/condor/condor_config
6/8 23:15:04 (pid:18901) Using local config sources:
6/8 23:15:04 (pid:18901)    /opt/condor/etc/condor_config.LINUX.X86_64
6/8 23:15:04 (pid:18901)    /home/condor/etc/xxxxxxx.local
6/8 23:15:04 (pid:18901) DaemonCore: Command Socket at <10.100.200.91:51074>
6/8 23:15:04 (pid:18901) History file rotation is enabled.
6/8 23:15:04 (pid:18901)   Maximum history file size is: 10000000 bytes
6/8 23:15:04 (pid:18901)   Number of rotated history files is: 10
6/8 23:15:04 (pid:18901) "/opt/condor/sbin/condor_shadow.pvm -classad" did not produce any output, ignoring

condor_restart -subs schedd shows no change in the log nor in behaviour.

A look at the process table shows that the corresponding condor_schedd
process is not owned by condor (as on all other submit machines) but by the
user who submitted a job cluster before the problem showed up.

How can this happen (and why doesn't it happen on other machines)?
Is there a config option I overlooked (but all machines are configured the
same way, except the "pool master" so I'd expect this behaviour on all
of them)?
A bug in 6.9.2?

Suggestions welcome - how to proceed, how to gather debugging information...

Ah, temporary fix: kill -6 $(pidof condor_schedd). Thereafter, a new condor_schedd will start up (and for a short time be owned by the user
who still has some jobs in the queue; then going back to condor as should
be). don't know why -6 ... -9 probably would have done the same (in a brute force manner)...


Steffen