Re: [HTCondor-users] Diagnosing the Queue

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Christoph,

Thanks for the response! From the MasterLog on the scheduler node, condor_startd appears to be constantly dying and restarting:

[15591333768] Started DaemonCore process “/usr/sbin/condor_startd”, pid and pgroup = 14248

[15591333774] DefaultReaper unexpectedly called on pid 14248, status 1024.

[15591333774] The STARTD (pid 14248) exited with status 4

[15591333774] restarting /usr/sbin/condor_startd in 3600 seconds

repeat ad infinitum

Do you think that could be attributed to the scheduler DB as you mention?

Date: Tue, 28 May 2019 21:01:50 +0200 (CEST)
From: "Beyer, Christoph" <christoph.beyer@xxxxxxx>
Subject: Re: [HTCondor-users] Diagnosing the Queue

Hi Eric,

the 'remove' of 31k jobs comes at a price I guess, we do see similar things sometimes when a lot of 'single' jobs have state changes e.g. from idle to hold or removed the scheduler becomes kind of unresponsive to other tasks. You might want to put the scheduler db on a ssd device which makes these operations a lot faster or split the load from the scheduler on two different machines.

Scripted 'condor_q' requests can be a nuisance too by the way ;)

Best

Christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Mailing List Archives

Public Access

Re: [HTCondor-users] Diagnosing the Queue