[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Collector daemon crashing on Windows due to file descriptor limit



Hi John,

 

Thanks for the detailed response!

 

It looks like we will move to a Linux-based CM indeed, but the suggestion re: child collectors sounds very promising and will hopefully at least tide us over until the migration is complete.

 

Is there a rough magnitude for how many connections a Linux-based CM can support?

 

Kind regards,

 

Peet Whittaker

Discipline Lead for DevOps | Principal Software Developer

 

From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: 01 June 2022 16:11
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Collector daemon crashing on Windows due to file descriptor limit

 

This is a known limitation of the Windows collector, it has a maximum number of connections of 1014.  (1024-10)

By default, each execute node will use 2 connections,  one for the condor_master daemon, and one for the condor_startd daemon.  

 

You can work around this by adding more collectors.  In the simplest case, you can send the condor_master ads to a different collector than the condor_startd ads, this will allow your pool to grow to about 1000 execute nodes.

 

The more general case is a tree of collectors, with child collectors forwarding ads to a top level collector.  There are instructions on how to configure that here.  https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigCollectors

 

Or you can switch to using a Linux collector/negotiator, which has a much higher connection limit.

 

The

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Peet Whittaker
Sent: Tuesday, May 31, 2022 2:53 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Collector daemon crashing on Windows due to file descriptor limit

 

Hi,

 

We’re running a vanilla universe Condor pool on AWS that automatically scales up and down based on the job queue.

 

The pool consists of a Windows-based central manager (running the schedd, collector, negotiator and credd) and Windows-based execute nodes.

 

Generally everything works well. However, once the number of nodes exceeds ~500 (~3000 slots), the collector daemon starts repeatedly crashing every 10 mins (it’s quite regular).

 

...

04/28/22 21:34:00 Got QUERY_SCHEDD_ADS

04/28/22 21:34:00 (Sending 1 ads in response to query)

04/28/22 21:34:00 Query info: matched=1; skipped=0; query_time=0.000041; send_time=0.000105; type=Scheduler; requirements={((stricmp(Name,"ABC.XYZ.com") == 0))}; locate=1; limit=0; from=TOOL; peer=<10.0.0.252:51634>; projection={MyAddress AddressV1 CondorVersion CondorPlatform Name Machine}

04/28/22 21:34:01 MasterAd     : Inserting ** "< EC2AMAZ-IO96AHI.XYZ.com >"

04/28/22 21:34:01 WARNING: cannot register TCP update socket from <10.1.1.238:50279>: file descriptor safety level exceeded:  limit 1014,  registered socket count 1014,  fd 5364

04/28/22 21:34:12 StartdAd     : Inserting ** "< slot4@xxxxxxxxxxxxxxxxxxxxxxx , 10.1.1.238 >"

04/28/22 21:34:12 StartdPvtAd  : Inserting ** "< slot4@xxxxxxxxxxxxxxxxxxxxxxx , 10.1.1.238 >"

04/28/22 21:34:12 WARNING: cannot register TCP update socket from <10.1.1.238:50291>: file descriptor safety level exceeded:  limit 1014,  registered socket count 1014,  fd 5220

04/28/22 21:34:20 MasterAd     : Inserting ** "< EC2AMAZ-KOU1A4V.XYZ.com >"

04/28/22 21:34:20 WARNING: cannot register TCP update socket from <10.1.4.192:59370>: file descriptor safety level exceeded:  limit 1014,  registered socket count 1015,  fd 5368

04/28/22 21:34:20 MasterAd     : Inserting ** "< EC2AMAZ-G6I727N.XYZ.com >"

04/28/22 21:34:20 WARNING: cannot register TCP update socket from <10.1.0.50:56786>: file descriptor safety level exceeded:  limit 1014,  registered socket count 1016,  fd 5348

04/28/22 21:34:20 ERROR "Selector::add_fd(): read fd_set is full" at line 261 in file C:\condor\execute\dir_6408\sources\src\condor_utils\selector.cpp

04/28/22 21:34:30 ******************************************************

04/28/22 21:34:30 ** condor_collector.exe (CONDOR_COLLECTOR) STARTING UP

...

 

Restarting the central manager doesn’t help. The central manager also doesn’t seem to be under any particular memory or CPU pressure.

 

Any pointers/ideas on how to fix this would be greatly appreciated!

 

Relevant Condor version info:

 

$CondorVersion: 8.8.12 Nov 24 2020 BuildID: 524104 $

$CondorPlatform: x86_64_Windows10 $

 

Kind regards,

 

Peet Whittaker

Discipline Lead for DevOps | Principal Software Developer

 

JBA Consulting, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 3FD. Telephone: +441756699500

Visit our new website at  www.jbaconsulting.com.

This email is covered by the JBA Consulting email disclaimer
JBA Consulting is a trading name of Jeremy Benn Associates Limited, registered in England, company number 03246693, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 3FD.

Image removed by sender. JBA CONSULTING