[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Unresponsive to condor_status and/or condor_q



So, starting with condor 8.4.4 (8.2.10 previously, that did not exhibit this issue), and now continuing in 8.4.6 (we skipped over 8.4.5), we have an odd situation. Every 5 minutes, from 2 separate hosts (but not on the same minute) I do a condor_status and condor_q pair to dump information from our gatekeepers. The main gatekeeper typically has ~4200 jobs running on ~7600 cores (a single-core and multi-core job mix). After approximately 1 week, that gatekeeper begins to have problems responding to these queries, from either itself of the other host. See the attached image, and note the spikes driving down towards zero. This can be resolved by a "service condor restart" on the main gatekeeper, until another week or so passes by at which time the problem again asserts.

Has anyone else seen this issue? Any suggestions? Seems perhaps like a memory leak, or....

The gatekeeper is a VM with 16GB of RAM, 4 cores, and access to a shared pair of 10Gb NICs. There was no noticeable change in Ganglia load_one around the time of the HTCondor restart, or for that matter no other metric seemed "off".

Thanks,
bob