[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Removing a node ungracefully from the master.



On Thu, Feb 21, 2013 at 01:47:16PM -0800, Matt Wallington wrote:
> Maybe i'm doing something wrong but I have two nodes (cnode1.cpusage.comand
> cnode2.cpusage.com) and I scheduled a job which is now running on
> cnode1.cpusage.com.  I created a file "ad_file" with the following:
> MyType = "Query"
> TargetType = "Machine"
> Requirements = Name == "cnode1.cpusage.com"
> 
> I then ran the following command from the schedd / master system:
> condor_advertise INVALIDATE_STARTD_ADS ad_file
> 
> It responded with "Sent 1 of 1 ad to clustermaster.cpusage.com" (the name
> of my master / schedd)

You sent a message to the *collector*.  The collector will not have any
startd ads until the startd on cnode1.cpusage.com sends another ad.
What you have not done is told the schedd that you no longer want your
job to run on cnode1.cpusage.com.  To do that, you would run
condor_vacate_job.

You have also not told the startd to stop running your job, or turned
off the power, or done any other action that will cause your job to stop
running.

> 
> Nothing happened.  The job kept on trucking along on cnode1.  I would have
> expected it to stop and restart on cnode2.cpusage.com.  I also tried
> yanking the power cord to cnode1 midway through the job and then ran the
> condor_advertise command and same thing, it kept the hung job on cnode1
> until it hit the timeout several minutes later.  (which is the same
> behavior it has without running the INVALIDATE_STARTD_ADS command).

This is the expected behavior.  The schedd cannot tell the difference
between a node that has lost power and a node that it is momentarily
unable to connect to.  So it optimistically tries to reconnect for a
period of time, until it gives up; then it tries to find a new machine
to run on.

It looks like ALIVE_INTERVAL may be the config knob to use if you want
to reduce the amount of time that the schedd takes to time out.
> 
> Here's the last few lines of my SchedLog if that helps:
> 
> 02/21/13 13:39:37 (pid:793) Sent ad to central manager for
> matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> 02/21/13 13:39:37 (pid:793) Sent ad to 1 collectors for
> matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> 02/21/13 13:39:37 (pid:793) Completed REQUEST_CLAIM to startd
> slot2@xxxxxxxxxxxxxxxxxx <10.1.10.12:39852> for matt
> 02/21/13 13:39:37 (pid:793) Starting add_shadow_birthdate(49.0)
> 02/21/13 13:39:37 (pid:793) Started shadow for job 49.0 on
> slot2@xxxxxxxxxxxxxxxxxx <10.1.10.12:39852> for matt, (shadow pid = 14258)
> 02/21/13 13:39:41 (pid:793) Number of Active Workers 1
> 02/21/13 13:39:41 (pid:14266) Number of Active Workers 0
> 02/21/13 13:40:37 (pid:793) Activity on stashed negotiator socket: <
> 10.1.10.11:60412>
> 02/21/13 13:40:37 (pid:793) Using negotiation protocol: NEGOTIATE
> 02/21/13 13:40:37 (pid:793) Negotiating for owner:
> matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> 02/21/13 13:40:37 (pid:793) Finished negotiating for matt in local pool: 0
> matched, 0 rejected
> 02/21/13 13:40:37 (pid:793) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 02/21/13 13:40:37 (pid:793) Sent ad to central manager for
> matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> 02/21/13 13:40:37 (pid:793) Sent ad to 1 collectors for
> matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> 02/21/13 13:42:54 (pid:793) Cleaning job queue...
> 02/21/13 13:42:54 (pid:793) About to rotate ClassAd log
> /var/lib/condor/spool/job_queue.log
> 02/21/13 13:44:14 (pid:793) Number of Active Workers 1
> 02/21/13 13:44:14 (pid:15112) Number of Active Workers 0
> 02/21/13 13:45:37 (pid:793) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 02/21/13 13:45:37 (pid:793) Sent ad to central manager for
> matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> 02/21/13 13:45:37 (pid:793) Sent ad to 1 collectors for
> matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> 
> 
> On Thu, Feb 21, 2013 at 5:01 AM, Nathan Panike <nwp@xxxxxxxxxxx> wrote:
> 
> > On Wed, Feb 20, 2013 at 04:58:45PM -0800, Matt Wallington wrote:
> > > So if I understand correctly, that's to prevent additional jobs from
> > > running on that machine correct?  But how do I have Condor immediately
> > stop
> > > and resubmit the jobs that were running on that machine to the next
> > > available machine?  I essentially want that machine to go away from
> > condor
> > > until it becomes available again and any jobs that were running on it to
> > go
> > > back into the queue to be matched on another machine.
> >
> > That should happen automatically.

When I wrote automatically here, I did not say instantaneously.

> >
> > >
> > > Is this possible?
> > >
> > > On Wed, Feb 20, 2013 at 4:46 PM, Nathan Panike <nwp@xxxxxxxxxxx> wrote:
> > >
> > > > condor_advertise INVALIDATE_STARTD_ADS ...
> > > >
> > > > The condor_advertise man page covers this pretty well.
> > > >
> > > > Nathan Panike
> > > >
> > > > On Wed, Feb 20, 2013 at 03:46:23PM -0800, Matt Wallington wrote:
> > > > > I am having an issue where when a node is shut down forcefully (i.e.
> > the
> > > > > power cable is yanked from the system), The master and scheduler
> > > > continues
> > > > > to think the job is running on the node until it hits a timeout
> > (which
> > > > is a
> > > > > significant amount of time).  Eventually condor realizes the node is
> > > > > offline and it resubmits the job to another node.  I've found if I
> > reduce
> > > > > the timeout, then the node will timeout on jobs that run longer than
> > the
> > > > > timeout even if the node is online and operating properly.
> > > > >
> > > > > Is there a way to forcefully remove a node from the master (a node
> > that
> > > > has
> > > > > dropped offline but Condor still thinks is running) from the command
> > line
> > > > > on the scheduler or master?  I've tried condor_off, condor_vacate,
> > > > > condor_vacate_job, etc. but none of these work because they all try
> > to
> > > > > reach out to the node (which is now offline).  Is a command to simply
> > > > > remove a node from the pool immediately and have the job start over
> > on
> > > > > another node?
> > > > >
> > > > > Thanks,
> >
> 
> 
> 
> -- 
> Matt Wallington  |  CPUsage, Inc.  |  503-708-1919

-- 
Nathan Panike, nwp@xxxxxxxxxxx
UW-Madison Center for High Throughput Computing
Computer Sciences Department, Room 4280
1210 W. Dayton St.
Madison, WI 53706 USA
608.890.0032

Laboratory for Molecular and Computational Genomics
Biotechnology Center
425 Henry Mall, room 5445
Madison, WI 53706 USA
608.890.0086