[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Removing a node ungracefully from the master.



Maybe i'm doing something wrong but I have two nodes (cnode1.cpusage.com and cnode2.cpusage.com) and I scheduled a job which is now running on cnode1.cpusage.com.  I created a file "ad_file" with the following:
MyType = "Query"
TargetType = "Machine"
Requirements = Name == "cnode1.cpusage.com"

I then ran the following command from the schedd / master system:
condor_advertise INVALIDATE_STARTD_ADS ad_file

It responded with "Sent 1 of 1 ad to clustermaster.cpusage.com" (the name of my master / schedd)

Nothing happened.  The job kept on trucking along on cnode1.  I would have expected it to stop and restart on cnode2.cpusage.com.  I also tried yanking the power cord to cnode1 midway through the job and then ran the condor_advertise command and same thing, it kept the hung job on cnode1 until it hit the timeout several minutes later.  (which is the same behavior it has without running the INVALIDATE_STARTD_ADS command).

Here's the last few lines of my SchedLog if that helps:

02/21/13 13:39:37 (pid:793) Sent ad to central manager for matt@xxxxxxxxxxxxxxxxxxxxxxxxx
02/21/13 13:39:37 (pid:793) Sent ad to 1 collectors for matt@xxxxxxxxxxxxxxxxxxxxxxxxx
02/21/13 13:39:37 (pid:793) Completed REQUEST_CLAIM to startd slot2@xxxxxxxxxxxxxxxxxx <10.1.10.12:39852> for matt
02/21/13 13:39:37 (pid:793) Starting add_shadow_birthdate(49.0)
02/21/13 13:39:37 (pid:793) Started shadow for job 49.0 on slot2@xxxxxxxxxxxxxxxxxx <10.1.10.12:39852> for matt, (shadow pid = 14258)
02/21/13 13:39:41 (pid:793) Number of Active Workers 1
02/21/13 13:39:41 (pid:14266) Number of Active Workers 0
02/21/13 13:40:37 (pid:793) Activity on stashed negotiator socket: <10.1.10.11:60412>
02/21/13 13:40:37 (pid:793) Using negotiation protocol: NEGOTIATE
02/21/13 13:40:37 (pid:793) Negotiating for owner: matt@xxxxxxxxxxxxxxxxxxxxxxxxx
02/21/13 13:40:37 (pid:793) Finished negotiating for matt in local pool: 0 matched, 0 rejected
02/21/13 13:40:37 (pid:793) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
02/21/13 13:40:37 (pid:793) Sent ad to central manager for matt@xxxxxxxxxxxxxxxxxxxxxxxxx
02/21/13 13:40:37 (pid:793) Sent ad to 1 collectors for matt@xxxxxxxxxxxxxxxxxxxxxxxxx
02/21/13 13:42:54 (pid:793) Cleaning job queue...
02/21/13 13:42:54 (pid:793) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log
02/21/13 13:44:14 (pid:793) Number of Active Workers 1
02/21/13 13:44:14 (pid:15112) Number of Active Workers 0
02/21/13 13:45:37 (pid:793) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
02/21/13 13:45:37 (pid:793) Sent ad to central manager for matt@xxxxxxxxxxxxxxxxxxxxxxxxx
02/21/13 13:45:37 (pid:793) Sent ad to 1 collectors for matt@xxxxxxxxxxxxxxxxxxxxxxxxx


On Thu, Feb 21, 2013 at 5:01 AM, Nathan Panike <nwp@xxxxxxxxxxx> wrote:
On Wed, Feb 20, 2013 at 04:58:45PM -0800, Matt Wallington wrote:
> So if I understand correctly, that's to prevent additional jobs from
> running on that machine correct?  But how do I have Condor immediately stop
> and resubmit the jobs that were running on that machine to the next
> available machine?  I essentially want that machine to go away from condor
> until it becomes available again and any jobs that were running on it to go
> back into the queue to be matched on another machine.

That should happen automatically.

>
> Is this possible?
>
> On Wed, Feb 20, 2013 at 4:46 PM, Nathan Panike <nwp@xxxxxxxxxxx> wrote:
>
> > condor_advertise INVALIDATE_STARTD_ADS ...
> >
> > The condor_advertise man page covers this pretty well.
> >
> > Nathan Panike
> >
> > On Wed, Feb 20, 2013 at 03:46:23PM -0800, Matt Wallington wrote:
> > > I am having an issue where when a node is shut down forcefully (i.e. the
> > > power cable is yanked from the system), The master and scheduler
> > continues
> > > to think the job is running on the node until it hits a timeout (which
> > is a
> > > significant amount of time).  Eventually condor realizes the node is
> > > offline and it resubmits the job to another node.  I've found if I reduce
> > > the timeout, then the node will timeout on jobs that run longer than the
> > > timeout even if the node is online and operating properly.
> > >
> > > Is there a way to forcefully remove a node from the master (a node that
> > has
> > > dropped offline but Condor still thinks is running) from the command line
> > > on the scheduler or master?  I've tried condor_off, condor_vacate,
> > > condor_vacate_job, etc. but none of these work because they all try to
> > > reach out to the node (which is now offline).  Is a command to simply
> > > remove a node from the pool immediately and have the job start over on
> > > another node?
> > >
> > > Thanks,



--
Matt Wallington  |  CPUsage, Inc.  |  503-708-1919