[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Collector killed by OOM killer



Hi,

We had a problem today where the collector on one of our 2 central managers started using so much memory it started swapping and the CPU load average almost got to 20. It was also killed twice by the OOM killer:

Jul  3 13:43:51 condor01 kernel: condor_collecto invoked oom-killer
Jul  3 13:55:59 condor01 kernel: condor_collecto invoked oom-killer

At the same time the other central manager had a high CPU load but didn’t get to the point of anything being killed.

It seemed to be triggered by rebooting around 10 or so worker nodes. In CollectorLog (for the collector which was killed) the number of active workers suddenly increased to the maximum of 16 (there normally seem to be at most 1 or 2):

07/03/14 13:36:53 Got QUERY_STARTD_ADS
07/03/14 13:36:53 Number of Active Workers 11
07/03/14 13:36:53 (Sending 10275 ads in response to query)
07/03/14 13:36:53 (Sending 10275 ads in response to query)
07/03/14 13:36:53 Number of Active Workers 13
07/03/14 13:36:53 Got QUERY_STARTD_ADS
07/03/14 13:36:53 Number of Active Workers 12
07/03/14 13:36:53 Number of Active Workers 14
07/03/14 13:36:53 Got QUERY_STARTD_ADS
07/03/14 13:36:53 Number of Active Workers 13
07/03/14 13:36:53 (Sending 10275 ads in response to query)
07/03/14 13:36:53 (Sending 10275 ads in response to query)
07/03/14 13:36:53 Number of Active Workers 15
07/03/14 13:36:53 Got QUERY_STARTD_ADS
07/03/14 13:36:53 Number of Active Workers 14
07/03/14 13:36:53 Number of Active Workers 16
07/03/14 13:36:53 Got QUERY_STARTD_ADS
07/03/14 13:36:53 ForkWork: not forking because reached max workers 16
07/03/14 13:36:53 Number of Active Workers 16
07/03/14 13:36:53 Number of Active Workers 15

There was then a gap of 20 minutes in CollectorLog. After the collector was killed twice by the OOM, there were then failed condor_write attempts for the worker nodes which were down:

07/03/14 13:56:35 Buf::write(): condor_write() failed
07/03/14 13:56:35 Error sending query result to client -- aborting
07/03/14 13:56:35 condor_write(): Socket closed when trying to write 4096 bytes to <aaa.bbb.ccc.ddd:48596>, fd is 7

Then it seemed that every single ClassAd was removed:

07/03/14 13:56:52 Housekeeper:  Ready to clean old ads
07/03/14 13:56:52       Cleaning StartdAds ...
07/03/14 13:56:52               **** Removing stale ad: "< slot1@xxxxxxxxxxxxxx , a.b.c.d >"
07/03/14 13:56:52               **** Removing stale ad: "< slot1@xxxxxxxxxxxxxx , a.b.c.d >"
07/03/14 13:56:52               **** Removing stale ad: "< slot1@xxxxxxxxxxxxxx , a.b.c.d >"
... (many, many similar lines)

The negotiator had trouble contacting both collectors after this (*), and they were both blacklisted. Things later eventually returned to normal.

Does anyone know what happened? We are using HTCondor 8.0.6. I can provide a full log files off-list if necessary.

Many Thanks,
Andrew.

(*)
07/03/14 13:56:24 ---------- Started Negotiation Cycle ----------
07/03/14 13:56:24 Phase 1:  Obtaining ads from collector ...
07/03/14 13:56:24 Not considering preemption, therefore constraining idle machines with ifThenElse(State == "Claimed","Name State Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements SlotWeight ConcurrencyLimits","")
07/03/14 13:56:24   Getting startd private ads ...
07/03/14 13:57:24 condor_read(): timeout reading 21 bytes from collector at <a.b.c.d:9618>.
07/03/14 13:57:24 IO: Failed to read packet header
07/03/14 13:57:24 Will avoid querying collector condor01.domain <a.b.c.d:9618> for 3540s if an alternative succeeds.
07/03/14 13:58:24 condor_read(): timeout reading 21 bytes from collector at <a.b.c.d:9618>.
07/03/14 13:58:24 IO: Failed to read packet header
07/03/14 13:58:24 Will avoid querying collector condor02.domain <a.b.c.d:9618> for 3541s if an alternative succeeds.
07/03/14 13:58:24 Couldn't fetch ads: communication error
07/03/14 13:58:24 Aborting negotiation cycle
07/03/14 13:58:24 ---------- Started Negotiation Cycle ----------
07/03/14 13:58:24 Phase 1:  Obtaining ads from collector ...
07/03/14 13:58:24 Not considering preemption, therefore constraining idle machines with ifThenElse(State == "Claimed","Name State Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements SlotWeight ConcurrencyLimits","")
07/03/14 13:58:24   Getting startd private ads ...
07/03/14 13:58:24 Collector condor01.domain blacklisted; skipping
07/03/14 13:58:24 Collector condor02.domain blacklisted; skipping
07/03/14 13:58:24 Couldn't fetch ads: communication error
07/03/14 13:58:24 Aborting negotiation cycle


-- 
Scanned by iCritical.