[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] defrag killing schedd?



After running Condor for a couple of weeks on a medium-sized pool,
I started to receive complaints that users became unable to run their
jobs.
It turned out that I had defined partitinable slots but forgot to
add DEFRAG to the DAEMON_LIST on the CM machine.
After fixing that, while keeping "draining" disabled (the pool policy
is to never evict any running job), the large number of "claimed"
slots went down, but I started to see messages like these on the
second submit machine:

14-08-20_14:43:10 (pid:7573) WARNING: claim id not found for new dynamic slot slot1_4@xxxxxxxxxx -- ignoring this resource
14-08-20_14:43:10 (pid:7573) WARNING: claim id not found for new dynamic slot slot1_4@xxxxxxxxxx -- ignoring this resource
14-08-20_14:43:14 (pid:7573) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
14-08-20_14:43:14 (pid:7573) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
14-08-20_14:43:14 (pid:7573) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
14-08-20_14:43:14 (pid:7573) Sent ad to central manager for user1@pool
14-08-20_14:43:14 (pid:7573) Sent ad to 1 collectors for user1@pool
14-08-20_14:43:14 (pid:7573) Sent ad to central manager for user2@pool
14-08-20_14:43:14 (pid:7573) Sent ad to 1 collectors for user2@pool
14-08-20_14:43:14 (pid:7573) Delaying scheduling of parallel jobs because startd query time is long (1) seconds

... and shortly thereafter, the schedd had vanished (without any
additional message).

I still tend to blame this on
# condor_version
$CondorVersion: 8.1.5 Apr 03 2014 BuildID: 241118 $
$CondorPlatform: x86_64_Debian7 $

but I didn't have the opportunity to upgrade to 8.2.2 (or 8.3.0)
yet. Is this a side-effect of the new kernel, and the issues that
have been discussed before?
# uname -a
Linux submit2.pool 3.11-2-amd64 #1 SMP Debian 3.11.8-1 (2013-11-13) x86_64 GNU/Linux

Or is there something else I should watch out for?

(To me, the manual is a bit vague where DEFRAG should be run; am
I mistaken assuming that the CM would be enough?)

- S

--
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}