[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Diagnosing the Queue



Christoph,

 

Thanks for the response!  From the MasterLog on the scheduler node, condor_startd appears to be constantly dying and restarting:

 

[15591333768] Started DaemonCore process “/usr/sbin/condor_startd”, pid and pgroup = 14248

[15591333774] DefaultReaper unexpectedly called on pid 14248, status 1024.

[15591333774] The STARTD (pid 14248) exited with status 4

[15591333774] restarting /usr/sbin/condor_startd in 3600 seconds

.

.

.

repeat ad infinitum

 

Do you think that could be attributed to the scheduler DB as you mention?

 

 

 


  • Date: Tue, 28 May 2019 21:01:50 +0200 (CEST)
  • From: "Beyer, Christoph" <christoph.beyer@xxxxxxx>
  • Subject: Re: [HTCondor-users] Diagnosing the Queue

Hi Eric,

 

the 'remove' of 31k jobs comes at a price I guess, we do see similar things sometimes when a lot of 'single' jobs have state changes e.g. from idle to hold or removed the scheduler becomes kind of unresponsive to other tasks. You might want to put the scheduler db on a ssd device which makes these operations a lot faster or split the load from the scheduler on two different machines. 

 

Scripted 'condor_q' requests can be a nuisance too by the way ;)

 

Best

Christoph

 


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx