[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Removing a node ungracefully from the master.



Very nice!  That seemed to get it to work.  I know this is a bit of a hack but I have set up a daemon that receives ping requests from each of the nodes on an interval while they are running jobs.  If a node crashes, and the schedd server (which is running my daemon) doesn't receive a couple of pings, it assumes the node crashed so the daemon runs a condor_vacate_job -fast on whichever jobs are running on that system, then immediately runs the condor_advertise INVALIDATE_STARTD_ADS for that node.  I then run a series of condor_reschedule commands for the next couple of minutes on a periodic interval to force the scheduler to "hurry up" and find a new device to run the job on once it switches back to idle.  I now have the process down to about a one minute window of time where a job goes from being hung on a crashed machine to running on a new machine.

Unless you see any glaring issues with this solution, i'm going to roll with it.

Thanks for all your help!

Matt

On Fri, Feb 22, 2013 at 2:39 PM, Nathan Panike <nwp@xxxxxxxxxxx> wrote:
On Fri, Feb 22, 2013 at 01:48:08PM -0800, Matt Wallington wrote:
> Thanks Nathan.  So just to be clear, when a startd crashes and the job is
> stuck in "running" state, there is no command that you can run on the
> schedd to immediately cancel the job and have it resubmit to another node?
>  Sorry to keep rehashing this, I am just getting confused on whether I have
> to wait for it to hit the ALIVE_INTERVAL or whether I can just submit a
> command to the schedd to have it immediately reschedule the job onto
> another machine without having to try to connect to the startd.

You can run condor_vacate_job to immediately cancel the job. The collector will
keep the startd ad for a period of time after the startd shuts down; the
condor_advertise command is what flushes the startd ad from the collector (so
that the matchmaker does not just match you up with the same old machine). It
will then take a moment for the schedd to get a new match from the negotiator,
connect to the new startd, and begin running your job.
>
> I want it to just immediately assume the startd is gone and immediate
> reschedule onto another device.  If that's not possible I will have to
> write a custom solution to store the job submissions in a database and
> reschedule the job manually through an additional condor_submit instance
> when a node crashes.  I know i'm probably trying to make Condor do
> something it wasn't meant to do where I have a tightly controlled timeline
> for jobs to requeue when a node crashes.
>

Condor indeed might not be the best solution for your problem.

> Thanks again,
> Matt

Nathan Panike
>
> On Thu, Feb 21, 2013 at 5:39 PM, Nathan Panike <nwp@xxxxxxxxxxx> wrote:
>
> > On Thu, Feb 21, 2013 at 01:47:16PM -0800, Matt Wallington wrote:
> > > Maybe i'm doing something wrong but I have two nodes
> > (cnode1.cpusage.comand
> > > cnode2.cpusage.com) and I scheduled a job which is now running on
> > > cnode1.cpusage.com.  I created a file "ad_file" with the following:
> > > MyType = "Query"
> > > TargetType = "Machine"
> > > Requirements = Name == "cnode1.cpusage.com"
> > >
> > > I then ran the following command from the schedd / master system:
> > > condor_advertise INVALIDATE_STARTD_ADS ad_file
> > >
> > > It responded with "Sent 1 of 1 ad to clustermaster.cpusage.com" (the
> > name
> > > of my master / schedd)
> >
> > You sent a message to the *collector*.  The collector will not have any
> > startd ads until the startd on cnode1.cpusage.com sends another ad.
> > What you have not done is told the schedd that you no longer want your
> > job to run on cnode1.cpusage.com.  To do that, you would run
> > condor_vacate_job.
> >
> > You have also not told the startd to stop running your job, or turned
> > off the power, or done any other action that will cause your job to stop
> > running.
> >
> > >
> > > Nothing happened.  The job kept on trucking along on cnode1.  I would
> > have
> > > expected it to stop and restart on cnode2.cpusage.com.  I also tried
> > > yanking the power cord to cnode1 midway through the job and then ran the
> > > condor_advertise command and same thing, it kept the hung job on cnode1
> > > until it hit the timeout several minutes later.  (which is the same
> > > behavior it has without running the INVALIDATE_STARTD_ADS command).
> >
> > This is the expected behavior.  The schedd cannot tell the difference
> > between a node that has lost power and a node that it is momentarily
> > unable to connect to.  So it optimistically tries to reconnect for a
> > period of time, until it gives up; then it tries to find a new machine
> > to run on.
> >
> > It looks like ALIVE_INTERVAL may be the config knob to use if you want
> > to reduce the amount of time that the schedd takes to time out.
> > >
> > > Here's the last few lines of my SchedLog if that helps:
> > >
> > > 02/21/13 13:39:37 (pid:793) Sent ad to central manager for
> > > matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > 02/21/13 13:39:37 (pid:793) Sent ad to 1 collectors for
> > > matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > 02/21/13 13:39:37 (pid:793) Completed REQUEST_CLAIM to startd
> > > slot2@xxxxxxxxxxxxxxxxxx <10.1.10.12:39852> for matt
> > > 02/21/13 13:39:37 (pid:793) Starting add_shadow_birthdate(49.0)
> > > 02/21/13 13:39:37 (pid:793) Started shadow for job 49.0 on
> > > slot2@xxxxxxxxxxxxxxxxxx <10.1.10.12:39852> for matt, (shadow pid =
> > 14258)
> > > 02/21/13 13:39:41 (pid:793) Number of Active Workers 1
> > > 02/21/13 13:39:41 (pid:14266) Number of Active Workers 0
> > > 02/21/13 13:40:37 (pid:793) Activity on stashed negotiator socket: <
> > > 10.1.10.11:60412>
> > > 02/21/13 13:40:37 (pid:793) Using negotiation protocol: NEGOTIATE
> > > 02/21/13 13:40:37 (pid:793) Negotiating for owner:
> > > matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > 02/21/13 13:40:37 (pid:793) Finished negotiating for matt in local pool:
> > 0
> > > matched, 0 rejected
> > > 02/21/13 13:40:37 (pid:793) TransferQueueManager stats: active up=0/10
> > > down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> > > 02/21/13 13:40:37 (pid:793) Sent ad to central manager for
> > > matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > 02/21/13 13:40:37 (pid:793) Sent ad to 1 collectors for
> > > matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > 02/21/13 13:42:54 (pid:793) Cleaning job queue...
> > > 02/21/13 13:42:54 (pid:793) About to rotate ClassAd log
> > > /var/lib/condor/spool/job_queue.log
> > > 02/21/13 13:44:14 (pid:793) Number of Active Workers 1
> > > 02/21/13 13:44:14 (pid:15112) Number of Active Workers 0
> > > 02/21/13 13:45:37 (pid:793) TransferQueueManager stats: active up=0/10
> > > down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> > > 02/21/13 13:45:37 (pid:793) Sent ad to central manager for
> > > matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > 02/21/13 13:45:37 (pid:793) Sent ad to 1 collectors for
> > > matt@xxxxxxxxxxxxxxxxxxxxxxxxx
> > >
> > >
> > > On Thu, Feb 21, 2013 at 5:01 AM, Nathan Panike <nwp@xxxxxxxxxxx> wrote:
> > >
> > > > On Wed, Feb 20, 2013 at 04:58:45PM -0800, Matt Wallington wrote:
> > > > > So if I understand correctly, that's to prevent additional jobs from
> > > > > running on that machine correct?  But how do I have Condor
> > immediately
> > > > stop
> > > > > and resubmit the jobs that were running on that machine to the next
> > > > > available machine?  I essentially want that machine to go away from
> > > > condor
> > > > > until it becomes available again and any jobs that were running on
> > it to
> > > > go
> > > > > back into the queue to be matched on another machine.
> > > >
> > > > That should happen automatically.
> >
> > When I wrote automatically here, I did not say instantaneously.
> >
> > > >
> > > > >
> > > > > Is this possible?
> > > > >
> > > > > On Wed, Feb 20, 2013 at 4:46 PM, Nathan Panike <nwp@xxxxxxxxxxx>
> > wrote:
> > > > >
> > > > > > condor_advertise INVALIDATE_STARTD_ADS ...
> > > > > >
> > > > > > The condor_advertise man page covers this pretty well.
> > > > > >
> > > > > > Nathan Panike
> > > > > >
> > > > > > On Wed, Feb 20, 2013 at 03:46:23PM -0800, Matt Wallington wrote:
> > > > > > > I am having an issue where when a node is shut down forcefully
> > (i.e.
> > > > the
> > > > > > > power cable is yanked from the system), The master and scheduler
> > > > > > continues
> > > > > > > to think the job is running on the node until it hits a timeout
> > > > (which
> > > > > > is a
> > > > > > > significant amount of time).  Eventually condor realizes the
> > node is
> > > > > > > offline and it resubmits the job to another node.  I've found if
> > I
> > > > reduce
> > > > > > > the timeout, then the node will timeout on jobs that run longer
> > than
> > > > the
> > > > > > > timeout even if the node is online and operating properly.
> > > > > > >
> > > > > > > Is there a way to forcefully remove a node from the master (a
> > node
> > > > that
> > > > > > has
> > > > > > > dropped offline but Condor still thinks is running) from the
> > command
> > > > line
> > > > > > > on the scheduler or master?  I've tried condor_off,
> > condor_vacate,
> > > > > > > condor_vacate_job, etc. but none of these work because they all
> > try
> > > > to
> > > > > > > reach out to the node (which is now offline).  Is a command to
> > simply
> > > > > > > remove a node from the pool immediately and have the job start
> > over
> > > > on
> > > > > > > another node?
> > > > > > >
> > > > > > > Thanks,



--
Matt Wallington  |  CPUsage, Inc.  |  503-708-1919