[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Removing a node ungracefully from the master.



I am having an issue where when a node is shut down forcefully (i.e. the power cable is yanked from the system), The master and scheduler continues to think the job is running on the node until it hits a timeout (which is a significant amount of time).  Eventually condor realizes the node is offline and it resubmits the job to another node.  I've found if I reduce the timeout, then the node will timeout on jobs that run longer than the timeout even if the node is online and operating properly.

Is there a way to forcefully remove a node from the master (a node that has dropped offline but Condor still thinks is running) from the command line on the scheduler or master?  I've tried condor_off, condor_vacate, condor_vacate_job, etc. but none of these work because they all try to reach out to the node (which is now offline).  Is a command to simply remove a node from the pool immediately and have the job start over on another node?

Thanks,

--
Matt Wallington  |  CPUsage, Inc.  |  503-708-1919