[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Submit host has stopped working on production system

Dear All,

I've been trying to make a small config change to
our condor submit host and it looks as though it
has brought the whole pool down and killed off all
of the jobs running on it after it was working fine
for months.

After the condor_master and condor_schedd
were restarted the schedd seems to have gone
beserk and is taking > 98 % of the CPU. If I
try to run condor_q it just freezes. condor_status
is OK.

The schedd log reports:

6/24 11:51:54 ******************************************************
6/24 11:51:54 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
6/24 11:51:54 ** /opt1/condor/sbin/condor_schedd
6/24 11:51:54 ** $CondorVersion: 6.6.7 Oct 11 2004 $
6/24 11:51:54 ** $CondorPlatform: SUN4X-SOLARIS29 $
6/24 11:51:54 ** PID = 1121
6/24 11:51:54 ******************************************************
6/24 11:51:54 Using config file: /etc/condor/condor_config
6/24 11:51:54 Using local config files: /opt1/condor/home/condor_config.local
6/24 11:51:54 DaemonCore: Command Socket at <>
6/24 11:51:54 "/opt1/condor/sbin/condor_shadow.v63 -classad" did not produce any output, ignoring

but nothing else. This doesn't look like something you should
ignore to me !? Any idea what has gone wrong here ?

Important aside: how can I stop/start the daemons on the submit host
without killing all the jobs in the pool (some of which may have
been running for days or weeks).

thanks in advance,


Dr Ian C. Smith,
e-Science team,
University of Liverpool
Computing Services Department,
Room 4.09, Chadwick Tower
Tel: ++44 (0)151 794 3745
e-mail: i.c.smith@xxxxxxxxxxxxxxx