[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Removing a node ungracefully from the master.
- Date: Thu, 21 Feb 2013 07:01:03 -0600
- From: Nathan Panike <nwp@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Removing a node ungracefully from the master.
On Wed, Feb 20, 2013 at 04:58:45PM -0800, Matt Wallington wrote:
> So if I understand correctly, that's to prevent additional jobs from
> running on that machine correct? But how do I have Condor immediately stop
> and resubmit the jobs that were running on that machine to the next
> available machine? I essentially want that machine to go away from condor
> until it becomes available again and any jobs that were running on it to go
> back into the queue to be matched on another machine.
That should happen automatically.
> Is this possible?
> On Wed, Feb 20, 2013 at 4:46 PM, Nathan Panike <nwp@xxxxxxxxxxx> wrote:
> > condor_advertise INVALIDATE_STARTD_ADS ...
> > The condor_advertise man page covers this pretty well.
> > Nathan Panike
> > On Wed, Feb 20, 2013 at 03:46:23PM -0800, Matt Wallington wrote:
> > > I am having an issue where when a node is shut down forcefully (i.e. the
> > > power cable is yanked from the system), The master and scheduler
> > continues
> > > to think the job is running on the node until it hits a timeout (which
> > is a
> > > significant amount of time). Eventually condor realizes the node is
> > > offline and it resubmits the job to another node. I've found if I reduce
> > > the timeout, then the node will timeout on jobs that run longer than the
> > > timeout even if the node is online and operating properly.
> > >
> > > Is there a way to forcefully remove a node from the master (a node that
> > has
> > > dropped offline but Condor still thinks is running) from the command line
> > > on the scheduler or master? I've tried condor_off, condor_vacate,
> > > condor_vacate_job, etc. but none of these work because they all try to
> > > reach out to the node (which is now offline). Is a command to simply
> > > remove a node from the pool immediately and have the job start over on
> > > another node?
> > >
> > > Thanks,