[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.9.2 hung schedd





Stuart Anderson wrote:

On Wed, Jun 13, 2007 at 11:36:44AM -0500, Dan Bradley wrote:
And why doesn't 'condor_restart -sub schedd' work in this case?


Hmm. It worked for me when I tried it, but I'm running a pre-release of 6.9.3. The usual problem people have is that their security configuration doesn't allow condor_restart to operate from the machine where they are running it, but the command-line tool does not know whether the operation was rejected or not, so there is no visible complaint to the user. If you look in the schedd log, you will see a message indicating that it rejected the command.


One question is whether the schedd will honor a restart request when
it is blocked on a system call to obtain a file lock for a user log file?
Oops, good point. The answer is no, the schedd will not honor the restart request. When you specify 'condor_restart -sub schedd', the command goes directly to the schedd, which is hung and will therefore not process the command. You could instead do 'condor_off -schedd' followed by 'condor_on -schedd', because these commands go to condor_master, which will then stop and start the schedd. I don't know off-hand how long the master will wait for the schedd to gracefully shut down in this case, but it will eventually resort to hard kill signals if it has to.

--Dan