[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Collector daemon crashing on Windows due to file descriptor limit



Hi,

 

We’re running a vanilla universe Condor pool on AWS that automatically scales up and down based on the job queue.

 

The pool consists of a Windows-based central manager (running the schedd, collector, negotiator and credd) and Windows-based execute nodes.

 

Generally everything works well. However, once the number of nodes exceeds ~500 (~3000 slots), the collector daemon starts repeatedly crashing every 10 mins (it’s quite regular).

 

...

04/28/22 21:34:00 Got QUERY_SCHEDD_ADS

04/28/22 21:34:00 (Sending 1 ads in response to query)

04/28/22 21:34:00 Query info: matched=1; skipped=0; query_time=0.000041; send_time=0.000105; type=Scheduler; requirements={((stricmp(Name,"ABC.XYZ.com") == 0))}; locate=1; limit=0; from=TOOL; peer=<10.0.0.252:51634>; projection={MyAddress AddressV1 CondorVersion CondorPlatform Name Machine}

04/28/22 21:34:01 MasterAd     : Inserting ** "< EC2AMAZ-IO96AHI.XYZ.com >"

04/28/22 21:34:01 WARNING: cannot register TCP update socket from <10.1.1.238:50279>: file descriptor safety level exceeded:  limit 1014,  registered socket count 1014,  fd 5364

04/28/22 21:34:12 StartdAd     : Inserting ** "< slot4@xxxxxxxxxxxxxxxxxxxxxxx , 10.1.1.238 >"

04/28/22 21:34:12 StartdPvtAd  : Inserting ** "< slot4@xxxxxxxxxxxxxxxxxxxxxxx , 10.1.1.238 >"

04/28/22 21:34:12 WARNING: cannot register TCP update socket from <10.1.1.238:50291>: file descriptor safety level exceeded:  limit 1014,  registered socket count 1014,  fd 5220

04/28/22 21:34:20 MasterAd     : Inserting ** "< EC2AMAZ-KOU1A4V.XYZ.com >"

04/28/22 21:34:20 WARNING: cannot register TCP update socket from <10.1.4.192:59370>: file descriptor safety level exceeded:  limit 1014,  registered socket count 1015,  fd 5368

04/28/22 21:34:20 MasterAd     : Inserting ** "< EC2AMAZ-G6I727N.XYZ.com >"

04/28/22 21:34:20 WARNING: cannot register TCP update socket from <10.1.0.50:56786>: file descriptor safety level exceeded:  limit 1014,  registered socket count 1016,  fd 5348

04/28/22 21:34:20 ERROR "Selector::add_fd(): read fd_set is full" at line 261 in file C:\condor\execute\dir_6408\sources\src\condor_utils\selector.cpp

04/28/22 21:34:30 ******************************************************

04/28/22 21:34:30 ** condor_collector.exe (CONDOR_COLLECTOR) STARTING UP

...

 

Restarting the central manager doesn’t help. The central manager also doesn’t seem to be under any particular memory or CPU pressure.

 

Any pointers/ideas on how to fix this would be greatly appreciated!

 

Relevant Condor version info:

 

$CondorVersion: 8.8.12 Nov 24 2020 BuildID: 524104 $

$CondorPlatform: x86_64_Windows10 $

 

Kind regards,

 

Peet Whittaker

Discipline Lead for DevOps | Principal Software Developer

 

JBA Consulting, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 3FD. Telephone: +441756699500

Visit our new website at  www.jbaconsulting.com.


This email is covered by the JBA Consulting email disclaimer
JBA Consulting is a trading name of Jeremy Benn Associates Limited, registered in England, company number 03246693, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 3FD.

JBA CONSULTING