[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [CondorLIGO] defrag killing schedd?

On Wed, Aug 20, 2014 at 02:57:07PM +0200, Steffen Grunewald wrote:
> After running Condor for a couple of weeks on a medium-sized pool,
> I started to receive complaints that users became unable to run their
> jobs.
> It turned out that I had defined partitinable slots but forgot to
> add DEFRAG to the DAEMON_LIST on the CM machine.
> After fixing that, while keeping "draining" disabled (the pool policy
> is to never evict any running job), the large number of "claimed"
> slots went down ...
> Or is there something else I should watch out for?
> (To me, the manual is a bit vague where DEFRAG should be run; am
> I mistaken assuming that the CM would be enough?)

Answer 1: Running defrag on the CM is sufficient.
Observation 2: There was a second condor_collector process running
(kind of, as it renewed its PID every few minutes) on the CM. After
restarting Condor, that one had vanished - and with it the 
"communication errors"... strange. (I had only done a condor_reconfig
after adding DEFRAG to DAEMON_LIST - it must have started at that
point. But I'm not eager to reproduce that right now :))

I'm planning to upgrade to 8.2.2 today, so this mystery will hopefully
disappear. (There are others yet to be unveiled.)

- S