Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Removing a node ungracefully from the master.

Date: Thu, 21 Feb 2013 07:01:03 -0600
From: Nathan Panike <nwp@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Removing a node ungracefully from the master.

On Wed, Feb 20, 2013 at 04:58:45PM -0800, Matt Wallington wrote:
> So if I understand correctly, that's to prevent additional jobs from
> running on that machine correct?  But how do I have Condor immediately stop
> and resubmit the jobs that were running on that machine to the next
> available machine?  I essentially want that machine to go away from condor
> until it becomes available again and any jobs that were running on it to go
> back into the queue to be matched on another machine.

That should happen automatically.

> 
> Is this possible?
> 
> On Wed, Feb 20, 2013 at 4:46 PM, Nathan Panike <nwp@xxxxxxxxxxx> wrote:
> 
> > condor_advertise INVALIDATE_STARTD_ADS ...
> >
> > The condor_advertise man page covers this pretty well.
> >
> > Nathan Panike
> >
> > On Wed, Feb 20, 2013 at 03:46:23PM -0800, Matt Wallington wrote:
> > > I am having an issue where when a node is shut down forcefully (i.e. the
> > > power cable is yanked from the system), The master and scheduler
> > continues
> > > to think the job is running on the node until it hits a timeout (which
> > is a
> > > significant amount of time).  Eventually condor realizes the node is
> > > offline and it resubmits the job to another node.  I've found if I reduce
> > > the timeout, then the node will timeout on jobs that run longer than the
> > > timeout even if the node is online and operating properly.
> > >
> > > Is there a way to forcefully remove a node from the master (a node that
> > has
> > > dropped offline but Condor still thinks is running) from the command line
> > > on the scheduler or master?  I've tried condor_off, condor_vacate,
> > > condor_vacate_job, etc. but none of these work because they all try to
> > > reach out to the node (which is now offline).  Is a command to simply
> > > remove a node from the pool immediately and have the job start over on
> > > another node?
> > >
> > > Thanks,

Follow-Ups:
- Re: [HTCondor-users] Removing a node ungracefully from the master.
  - From: Matt Wallington

References:
- [HTCondor-users] Removing a node ungracefully from the master.
  - From: Matt Wallington
- Re: [HTCondor-users] Removing a node ungracefully from the master.
  - From: Nathan Panike
- Re: [HTCondor-users] Removing a node ungracefully from the master.
  - From: Matt Wallington

Prev by Date: Re: [HTCondor-users] Problem using condor_restart on machines without FQDN
Next by Date: Re: [HTCondor-users] Problem using condor_restart on machines without FQDN
Previous by thread: Re: [HTCondor-users] Removing a node ungracefully from the master.
Next by thread: Re: [HTCondor-users] Removing a node ungracefully from the master.
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Removing a node ungracefully from the master.