[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Submit host has stopped working on production system

It's all working again now (phew !). Looks like
the submit host was just overloaded - the schedd
is now taking just about 25 % of the CPU. All of
the jobs seem to be running again.


--On 24 June 2005 12:01 +0100 "Dr Ian C. Smith" <i.c.smith@xxxxxxxxxxxxxxx> wrote:

Dear All,

I've been trying to make a small config change to
our condor submit host and it looks as though it
has brought the whole pool down and killed off all
of the jobs running on it after it was working fine
for months.

After the condor_master and condor_schedd
were restarted the schedd seems to have gone
beserk and is taking > 98 % of the CPU. If I
try to run condor_q it just freezes. condor_status
is OK.

The schedd log reports:

6/24 11:51:54 ******************************************************
6/24 11:51:54 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
6/24 11:51:54 ** /opt1/condor/sbin/condor_schedd
6/24 11:51:54 ** $CondorVersion: 6.6.7 Oct 11 2004 $
6/24 11:51:54 ** $CondorPlatform: SUN4X-SOLARIS29 $
6/24 11:51:54 ** PID = 1121
6/24 11:51:54 ******************************************************
6/24 11:51:54 Using config file: /etc/condor/condor_config
6/24 11:51:54 Using local config files:
6/24 11:51:54 DaemonCore: Command Socket at <>
6/24 11:51:54 "/opt1/condor/sbin/condor_shadow.v63 -classad" did not
produce any output, ignoring

but nothing else. This doesn't look like something you should
ignore to me !? Any idea what has gone wrong here ?

Important aside: how can I stop/start the daemons on the submit host
without killing all the jobs in the pool (some of which may have
been running for days or weeks).

thanks in advance,


Dr Ian C. Smith,
e-Science team,
University of Liverpool
Computing Services Department,
Room 4.09, Chadwick Tower
Tel: ++44 (0)151 794 3745
e-mail: i.c.smith@xxxxxxxxxxxxxxx

Condor-users mailing list