[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor 6.9.2 hung schedd



Hi,

for the n-th time, I found one of the schedds in my Condor pool dead.
(The machine is the "pool master" but otherwise all submit machines
= cluster headnodes are configured the same.)

A local condor_q fails (with the usual "Failed to fetch ads" message,
and yes, the port number is the same I checked with netstat -tlp).

The last lines in the SchedLog (kept locally) are:

6/8 22:14:54 (pid:16309) ZKM: setting default map to xxxx@xxxxxxxxxxxxxxxxxx
6/8 23:15:04 (pid:18901) ******************************************************
6/8 23:15:04 (pid:18901) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
6/8 23:15:04 (pid:18901) ** /opt/condor/sbin/condor_schedd
6/8 23:15:04 (pid:18901) ** $CondorVersion: 6.9.2 Apr  9 2007 $
6/8 23:15:04 (pid:18901) ** $CondorPlatform: X86_64-LINUX_RHEL3 $
6/8 23:15:04 (pid:18901) ** PID = 18901
6/8 23:15:04 (pid:18901) ** Log last touched 6/8 22:14:54
6/8 23:15:04 (pid:18901) ******************************************************
6/8 23:15:04 (pid:18901) Using config source: /etc/condor/condor_config
6/8 23:15:04 (pid:18901) Using local config sources:
6/8 23:15:04 (pid:18901)    /opt/condor/etc/condor_config.LINUX.X86_64
6/8 23:15:04 (pid:18901)    /home/condor/etc/xxxxxxx.local
6/8 23:15:04 (pid:18901) DaemonCore: Command Socket at <10.100.200.91:51074>
6/8 23:15:04 (pid:18901) History file rotation is enabled.
6/8 23:15:04 (pid:18901)   Maximum history file size is: 10000000 bytes
6/8 23:15:04 (pid:18901)   Number of rotated history files is: 10
6/8 23:15:04 (pid:18901) "/opt/condor/sbin/condor_shadow.pvm -classad" did not produce any output, ignoring

condor_restart -subs schedd shows no change in the log nor in behaviour.

A look at the process table shows that the corresponding condor_schedd
process is not owned by condor (as on all other submit machines) but by the
user who submitted a job cluster before the problem showed up.

How can this happen (and why doesn't it happen on other machines)?
Is there a config option I overlooked (but all machines are configured the
same way, except the "pool master" so I'd expect this behaviour on all
of them)?
A bug in 6.9.2?

Suggestions welcome - how to proceed, how to gather debugging information...

Ah, temporary fix: kill -6 $(pidof condor_schedd). Thereafter, a new 
condor_schedd will start up (and for a short time be owned by the user
who still has some jobs in the queue; then going back to condor as should
be). don't know why -6 ... -9 probably would have done the same (in a 
brute force manner)...


Steffen


-- 
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}
No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html