Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [CondorLIGO] defrag killing schedd?

Date: Thu, 21 Aug 2014 09:48:06 +0200
From: Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx>
Subject: Re: [HTCondor-users] [CondorLIGO] defrag killing schedd?

On Wed, Aug 20, 2014 at 02:57:07PM +0200, Steffen Grunewald wrote:
> After running Condor for a couple of weeks on a medium-sized pool,
> I started to receive complaints that users became unable to run their
> jobs.
> It turned out that I had defined partitinable slots but forgot to
> add DEFRAG to the DAEMON_LIST on the CM machine.
> After fixing that, while keeping "draining" disabled (the pool policy
> is to never evict any running job), the large number of "claimed"
> slots went down ...
[...]
> Or is there something else I should watch out for?
> 
> (To me, the manual is a bit vague where DEFRAG should be run; am
> I mistaken assuming that the CM would be enough?)

Answer 1: Running defrag on the CM is sufficient.
Observation 2: There was a second condor_collector process running
(kind of, as it renewed its PID every few minutes) on the CM. After
restarting Condor, that one had vanished - and with it the 
"communication errors"... strange. (I had only done a condor_reconfig
after adding DEFRAG to DAEMON_LIST - it must have started at that
point. But I'm not eager to reproduce that right now :))

I'm planning to upgrade to 8.2.2 today, so this mystery will hopefully
disappear. (There are others yet to be unveiled.)

- S

References:
- [HTCondor-users] defrag killing schedd?
  - From: Steffen Grunewald

Prev by Date: [HTCondor-users] defrag killing schedd?
Next by Date: [HTCondor-users] Call for Participation: IEEE International Conference on Cluster Computing (Cluster) 2014 -- Sept 22-26 in Madrid Spain
Previous by thread: [HTCondor-users] defrag killing schedd?
Next by thread: [HTCondor-users] Call for Participation: IEEE International Conference on Cluster Computing (Cluster) 2014 -- Sept 22-26 in Madrid Spain
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] [CondorLIGO] defrag killing schedd?