On Wed, Feb 20, 2013 at 03:46:23PM -0800, Matt Wallington wrote:
> I am having an issue where when a node is shut down forcefully (i.e. the
> power cable is yanked from the system), The master and scheduler continues
> to think the job is running on the node until it hits a timeout (which is a
> significant amount of time). Eventually condor realizes the node is
> offline and it resubmits the job to another node. I've found if I reduce
> the timeout, then the node will timeout on jobs that run longer than the
> timeout even if the node is online and operating properly.
> Is there a way to forcefully remove a node from the master (a node that has
> dropped offline but Condor still thinks is running) from the command line
> on the scheduler or master? I've tried condor_off, condor_vacate,
> condor_vacate_job, etc. but none of these work because they all try to
> reach out to the node (which is now offline). Is a command to simply
> remove a node from the pool immediately and have the job start over on
> another node?