[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Collector killed by OOM killer



Hi Andrew,

I'm stumped too.

TJ - is there anything "unique" to PERSISTENT_CONFIG_DIR that might be causing an issue on reconfig?

Brian

On Jul 4, 2014, at 8:00 AM, andrew.lahiff@xxxxxxxxxx wrote:

> Hi,
> 
> The same problem happened a few times this morning, but we've been able to narrow down what seems to be causing this. We have an attribute StartJobs included in STARTD_ATTRS, and our START expression contains "(StartJobs =?= True)". We also have:
> 
> STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs
> ENABLE_PERSISTENT_CONFIG = TRUE
> PERSISTENT_CONFIG_DIR = /etc/condor/ral
> 
> so that we can use condor_config_val to change the value of StartJobs. We've found that:
> 
> * changing the value of StartJobs in the config file and running condor_reconfig for lots of worker nodes at the same time (e.g. 60) is fine
> 
> * changing the value of StartJobs using condor_config_val and running condor_reconfig for lots of worker nodes at the same time is also fine
> 
> * someone wrote a 'clever' script which instead of running condor_config_val, just writes the appropriate files into PERSISTENT_CONFIG_DIR then runs condor_reconfig. When this is run for many worker nodes at the same time, this puts an enormous load on the collectors (high memory usage, CPU load, and wait io) and causes lots of communication problems, e.g.
> 
> 07/04/14 13:34:45 condor_write(): Socket closed when trying to write 294 bytes to <aaa.bbb.ccc.ddd.eee:48342>, fd is 6
> 07/04/14 13:34:45 Buf::write(): condor_write() failed
> 07/04/14 13:34:45 SECMAN: Error sending response classad to <aaa.bbb.ccc.ddd.eee:48342>!
> 
> So it seems that it's our own fault, and we'll stop using this script of course :-) I'm still curious though why doing this puts such a high load on the collectors...
> 
> Regards,
> Andrew.
> 
> ________________________________________
> From: andrew.lahiff@xxxxxxxxxx [andrew.lahiff@xxxxxxxxxx]
> Sent: Thursday, July 03, 2014 7:44 PM
> To: htcondor-users@xxxxxxxxxxx
> Subject: [HTCondor-users] Collector killed by OOM killer
> 
> Hi,
> 
> We had a problem today where the collector on one of our 2 central managers started using so much memory it started swapping and the CPU load average almost got to 20. It was also killed twice by the OOM killer:
> 
> Jul  3 13:43:51 condor01 kernel: condor_collecto invoked oom-killer
> Jul  3 13:55:59 condor01 kernel: condor_collecto invoked oom-killer
> 
> At the same time the other central manager had a high CPU load but didn’t get to the point of anything being killed.
> 
> It seemed to be triggered by rebooting around 10 or so worker nodes. In CollectorLog (for the collector which was killed) the number of active workers suddenly increased to the maximum of 16 (there normally seem to be at most 1 or 2):
> 
> 07/03/14 13:36:53 Got QUERY_STARTD_ADS
> 07/03/14 13:36:53 Number of Active Workers 11
> 07/03/14 13:36:53 (Sending 10275 ads in response to query)
> 07/03/14 13:36:53 (Sending 10275 ads in response to query)
> 07/03/14 13:36:53 Number of Active Workers 13
> 07/03/14 13:36:53 Got QUERY_STARTD_ADS
> 07/03/14 13:36:53 Number of Active Workers 12
> 07/03/14 13:36:53 Number of Active Workers 14
> 07/03/14 13:36:53 Got QUERY_STARTD_ADS
> 07/03/14 13:36:53 Number of Active Workers 13
> 07/03/14 13:36:53 (Sending 10275 ads in response to query)
> 07/03/14 13:36:53 (Sending 10275 ads in response to query)
> 07/03/14 13:36:53 Number of Active Workers 15
> 07/03/14 13:36:53 Got QUERY_STARTD_ADS
> 07/03/14 13:36:53 Number of Active Workers 14
> 07/03/14 13:36:53 Number of Active Workers 16
> 07/03/14 13:36:53 Got QUERY_STARTD_ADS
> 07/03/14 13:36:53 ForkWork: not forking because reached max workers 16
> 07/03/14 13:36:53 Number of Active Workers 16
> 07/03/14 13:36:53 Number of Active Workers 15
> 
> There was then a gap of 20 minutes in CollectorLog. After the collector was killed twice by the OOM, there were then failed condor_write attempts for the worker nodes which were down:
> 
> 07/03/14 13:56:35 Buf::write(): condor_write() failed
> 07/03/14 13:56:35 Error sending query result to client -- aborting
> 07/03/14 13:56:35 condor_write(): Socket closed when trying to write 4096 bytes to <aaa.bbb.ccc.ddd:48596>, fd is 7
> 
> Then it seemed that every single ClassAd was removed:
> 
> 07/03/14 13:56:52 Housekeeper:  Ready to clean old ads
> 07/03/14 13:56:52       Cleaning StartdAds ...
> 07/03/14 13:56:52               **** Removing stale ad: "< slot1@xxxxxxxxxxxxxx , a.b.c.d >"
> 07/03/14 13:56:52               **** Removing stale ad: "< slot1@xxxxxxxxxxxxxx , a.b.c.d >"
> 07/03/14 13:56:52               **** Removing stale ad: "< slot1@xxxxxxxxxxxxxx , a.b.c.d >"
> ... (many, many similar lines)
> 
> The negotiator had trouble contacting both collectors after this (*), and they were both blacklisted. Things later eventually returned to normal.
> 
> Does anyone know what happened? We are using HTCondor 8.0.6. I can provide a full log files off-list if necessary.
> 
> Many Thanks,
> Andrew.
> 
> (*)
> 07/03/14 13:56:24 ---------- Started Negotiation Cycle ----------
> 07/03/14 13:56:24 Phase 1:  Obtaining ads from collector ...
> 07/03/14 13:56:24 Not considering preemption, therefore constraining idle machines with ifThenElse(State == "Claimed","Name State Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements SlotWeight ConcurrencyLimits","")
> 07/03/14 13:56:24   Getting startd private ads ...
> 07/03/14 13:57:24 condor_read(): timeout reading 21 bytes from collector at <a.b.c.d:9618>.
> 07/03/14 13:57:24 IO: Failed to read packet header
> 07/03/14 13:57:24 Will avoid querying collector condor01.domain <a.b.c.d:9618> for 3540s if an alternative succeeds.
> 07/03/14 13:58:24 condor_read(): timeout reading 21 bytes from collector at <a.b.c.d:9618>.
> 07/03/14 13:58:24 IO: Failed to read packet header
> 07/03/14 13:58:24 Will avoid querying collector condor02.domain <a.b.c.d:9618> for 3541s if an alternative succeeds.
> 07/03/14 13:58:24 Couldn't fetch ads: communication error
> 07/03/14 13:58:24 Aborting negotiation cycle
> 07/03/14 13:58:24 ---------- Started Negotiation Cycle ----------
> 07/03/14 13:58:24 Phase 1:  Obtaining ads from collector ...
> 07/03/14 13:58:24 Not considering preemption, therefore constraining idle machines with ifThenElse(State == "Claimed","Name State Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements SlotWeight ConcurrencyLimits","")
> 07/03/14 13:58:24   Getting startd private ads ...
> 07/03/14 13:58:24 Collector condor01.domain blacklisted; skipping
> 07/03/14 13:58:24 Collector condor02.domain blacklisted; skipping
> 07/03/14 13:58:24 Couldn't fetch ads: communication error
> 07/03/14 13:58:24 Aborting negotiation cycle
> 
> 
> --
> Scanned by iCritical.
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> -- 
> Scanned by iCritical.
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/