[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Automatic restart of HTCondor service on central point



On Oct 7, 2013, at 4:09 AM, daniel popu <dpopu@xxxxxxxxx> wrote:

I have a pool of computers with HTCondor 8.0.2.
As you know, is critical to have working HTCondor on the central point.
 
The problem is that quite often Condor gets blocked and the central point will be down.
 
For previous versions I used the below DOS commands for an unsupervised hard restart (the restart was done automatically once a day):
 
solution 1:
net stop condor
net start condor
 
solution 2:
taskkill /f /im condor_master.exe
net start condor /y
 
With Condor 8.0.2 the above solutions are no longer efficient because some of the demons remain blocked and I cannot restart Condor service.
 
Any other working solutions when Condor is blocked?

Can you provide more details on Condor being "blocked"? Which daemons and/or commands are causing trouble? How often does this happen? Do the affected daemons continue to write to their log files?
If possible, we'd like to diagnose the underlying problem, whether it's a bug in HTCondor or a problem with a resource it's trying to use.

HTCondor has a mechanism to deal with "blocked" daemons. Each daemons sends an "alive" message to the condor_master on a regular basis. If the master doesn't receive any messages for an hour, it will kill that daemon and restart it.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project