[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Removing a node ungracefully from the master.



condor_advertise INVALIDATE_STARTD_ADS ...

The condor_advertise man page covers this pretty well.

Nathan Panike

On Wed, Feb 20, 2013 at 03:46:23PM -0800, Matt Wallington wrote:
> I am having an issue where when a node is shut down forcefully (i.e. the
> power cable is yanked from the system), The master and scheduler continues
> to think the job is running on the node until it hits a timeout (which is a
> significant amount of time).  Eventually condor realizes the node is
> offline and it resubmits the job to another node.  I've found if I reduce
> the timeout, then the node will timeout on jobs that run longer than the
> timeout even if the node is online and operating properly.
> 
> Is there a way to forcefully remove a node from the master (a node that has
> dropped offline but Condor still thinks is running) from the command line
> on the scheduler or master?  I've tried condor_off, condor_vacate,
> condor_vacate_job, etc. but none of these work because they all try to
> reach out to the node (which is now offline).  Is a command to simply
> remove a node from the pool immediately and have the job start over on
> another node?
> 
> Thanks,