[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_master stuck in DaemonCore Timers loop



My condor_master daemon on the Central Manager machine in my cluster was continuously taking 25% of the CPU load so I turned logging up to D_ALL to see what was going on.  When I did that I get the following message over and over (it filled 40 MB of logs in about 20 seconds.)

 

10/10 12:43:20 (fd:15) (pid:2345) In DaemonCore Timeout()

10/10 12:43:20 (fd:15) (pid:2345)

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> Timers

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> ~~~~~~

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 7, when = 1160502226, period = 300, handler_descrip=<Daemons::UpdateCollector()>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 0, when = 1160502234, period = 300, handler_descrip=<check_session_cache>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 9, when = 1160502239, period = 300, handler_descrip=<Daemons::CheckForNewExecutable()>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 4, when = 1160502249, period = 60, handler_descrip=<ProcFamily::takesnapshot>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 5, when = 1160502249, period = 60, handler_descrip=<ProcFamily::takesnapshot>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 6, when = 1160502249, period = 60, handler_descrip=<ProcFamily::takesnapshot>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 2, when = 1160502294, period = 240, handler_descrip=<self_monitor>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 3, when = 1160502534, period = 0, handler_descrip=<DaemonCore::ReInit()>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 1, when = 1160503142, period = 1801, handler_descrip=<handle_cookie_refresh>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 10, when = 1160505527, period = 0, handler_descrip=<DaemonCore::HungChildTimeout>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 11, when = 1160505527, period = 0, handler_descrip=<DaemonCore::HungChildTimeout>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 12, when = 1160505536, period = 0, handler_descrip=<DaemonCore::HungChildTimeout>

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 8, when = 1160534934, period = 86400, handler_descrip=<run_preen()>

10/10 12:43:20 (fd:15) (pid:2345)

10/10 12:43:20 (fd:15) (pid:2345) DaemonCore Timeout() Complete, returning 26

 

 

The return value seems to slowly go up, but everything else stays the same.  

 

A google search on "HungChildTimeout" or "DaemonCore Timers" didn't give me anything, so I'm hoping someone on this list can offer some insight…

 

Thanks a lot

-Colin

This email and any files transmitted with it are confidential, proprietary
and intended solely for the individual or entity to whom they are addressed.
If you have received this email in error please delete it immediately.