[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] does condor_off -peaceful -daemon startd node; works for vanilla jobs?



Hi,

 

here just more information, one can see all happens within a second and all jobs are gone (or restarted on another node).

 

I give the command on the central manager:

 

Thu Aug 18 19:43:03 CEST 2016

Sent "Set-Peaceful-Shutdown" command to startd node

Sent "Kill-Daemon-Peacefully" command to master node

Thu Aug 18 19:43:03 CEST 2016

 

And I see on /var/log/condor/StartLog on node:

08/18/16 19:43:03 Got SIGTERM. Performing graceful shutdown.

08/18/16 19:43:03 shutdown graceful

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 Killing job mips

08/18/16 19:43:03 Killing job kflops

08/18/16 19:43:03 slot1_1: Changing activity: Busy -> Retiring

08/18/16 19:43:03 slot1_1: State change: claim retirement ended/expired

08/18/16 19:43:03 slot1_1: Changing state and activity: Claimed/Retiring -> Preempting/Vacating

08/18/16 19:43:03 PERMISSION DENIED to submit-side@matchsession from host 192.168.xxx.xxx for command 403 (DEACTIVATE_CLAIM), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason

08/18/16 19:43:03 slot1_1: Got DEACTIVATE_CLAIM while in Preempting state, ignoring.

08/18/16 19:43:03 Starter pid 6873 exited with status 0

08/18/16 19:43:03 slot1_1: State change: starter exited

08/18/16 19:43:03 slot1_1: State change: No preempting claim, returning to owner

08/18/16 19:43:03 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle

08/18/16 19:43:03 slot1_1: State change: IS_OWNER is false

08/18/16 19:43:03 slot1_1: Changing state: Owner -> Unclaimed

08/18/16 19:43:03 slot1_1: Changing state: Unclaimed -> Delete

08/18/16 19:43:03 slot1_1: Resource no longer needed, deleting

08/18/16 19:43:03 Deleting cron job manager

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 CronJobList: Deleting all jobs

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 CronJobList: Deleting all jobs

08/18/16 19:43:03 Deleting benchmark job mgr

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 Killing job mips

08/18/16 19:43:03 Killing job kflops

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 Killing job mips

08/18/16 19:43:03 Killing job kflops

08/18/16 19:43:03 CronJobList: Deleting all jobs

08/18/16 19:43:03 CronJobList: Deleting job 'mips'

08/18/16 19:43:03 CronJob: Deleting job 'mips' (/usr/lib/condor/libexec/condor_mips), timer -1

08/18/16 19:43:03 CronJobList: Deleting job 'kflops'

08/18/16 19:43:03 CronJob: Deleting job 'kflops' (/usr/lib/condor/libexec/condor_kflops), timer -1

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 CronJobList: Deleting all jobs

08/18/16 19:43:03 All resources are free, exiting.

08/18/16 19:43:03 **** condor_startd (condor_STARTD) pid 6818 EXITING WITH STATUS 0

 

 

 

On Thursday 18 August 2016 19:12:55 Harald van Pee wrote:

> @Bop: I also give the command from the central manager.

>

> @Todd:

> I have no MaxJobRetirementTime defined (nothing with retire or time found

> on condor_config*, not on node, scheduler or central manager.

>

> condor_status| grep node

> slot1@node LINUX X86_64 Unclaimed Idle 0.230 63507 0+00:00:04

> slot1_1@node LINUX X86_64 Claimed Busy 0.000 1024 0+00:00:03

>

> after

> condor_off -peaceful -daemon startd node

> condor_status shows no node anymore (within 1 second, as fast as I can

> type).

>

> We use

>

> CLAIM_WORKLIFE = 120

> and

> STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

>

> NUM_SLOTS = 1

> SLOT_TYPE_1 = 100%

> SLOT_TYPE_1_PARTITIONABLE = true

> NUM_SLOTS_TYPE_1 = 1

>

> Any help is welcome.

>

> Harald

>

> On Thursday 18 August 2016 18:29:04 Todd Tannenbaum wrote:

> > As another data point, it also seemed to work for me running a

> > pre-release of HTCondor v8.5.7 on Scientific Linux 6.8.

> > Behold the simple test below; note the node went from Claimed/Busy to

> > Claimed/Retiring, which is expected. "Retiring" activity is defined in

> >

> > the Manual (from https://is.gd/mi7mVk ):

> > Retiring

> >

> > When an active claim is about to be preempted for any reason, it

> > enters

> >

> > retirement, while it waits for the current job to finish. The

> > MaxJobRetirementTime _expression_ determines how long to wait (counting

> > since the time the job started). Once the job finishes or the retirement

> > time expires, the Preempting state is entered.

> >

> > Perhaps you have a MaxJobRetirementTime defined ?

> >

> > regards,

> > Todd

> >

> > [tannenba@localhost test]$ condor_status

> > Name OpSys Arch State Activity LoadAv Mem

> > ActvtyTime

> >

> > slot1@localhost LINUX X86_64 Claimed Busy 0.000 330

> > 0+00:00:04 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000

> > 330 0+00:00:05 slot3@localhost LINUX X86_64 Unclaimed Idle

> > 0.000 330 0+00:00:06

> >

> > Total Owner Claimed Unclaimed Matched Preempting

> >

> > Backfill Drain

> >

> > X86_64/LINUX 3 0 1 2 0 0

> >

> > 0 0

> >

> > Total 3 0 1 2 0 0

> >

> > 0 0

> >

> > [tannenba@localhost test]$ condor_off -peaceful -daemon startd

> > Sent "Set-Peaceful-Shutdown" command to local startd

> > Sent "Kill-Daemon-Peacefully" command to local master

> >

> > [tannenba@localhost test]$ condor_status

> > Name OpSys Arch State Activity LoadAv Mem

> > ActvtyTime

> >

> > slot1@localhost LINUX X86_64 Claimed Retiring 0.000 330

> > 0+00:00:03 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000

> > 330 0+00:02:49 slot3@localhost LINUX X86_64 Unclaimed Idle

> > 0.000 330 0+00:00:06

> >

> > Total Owner Claimed Unclaimed Matched Preempting

> >

> > Backfill Drain

> >

> > X86_64/LINUX 3 0 1 2 0 0

> >

> > 0 0

> >

> > Total 3 0 1 2 0 0

> >

> > 0 0

> >

> > On 8/18/2016 11:11 AM, Bob Ball wrote:

> > > Just as a data point, I do, from our central manager machine,

> > > condor_off -peaceful -daemon startd -name $publicName

> > > and it runs just fine. All our jobs are vanilla. HTCondor is version

> > > 8.4.6 on Scientific Linux.

> > >

> > > bob

> > >

> > > On 8/18/2016 11:54 AM, Harald van Pee wrote:

> > >> Hi,

> > >>

> > >> I want to set a job running node offline, but only after all running

> > >> jobs have finished. Of course until then no new jobs should be

> > >> accepted on that node.

> > >>

> > >> I tried the command:

> > >>

> > >> condor_off -peaceful -daemon startd node

> > >>

> > >> and got the message:

> > >>

> > >> Sent "Set-Peaceful-Shutdown" command to startd node

> > >>

> > >> Sent "Kill-Daemon-Peacefully" command to master node

> > >>

> > >> On node I see in StartLog

> > >>

> > >> 08/18/16 17:20:49 Got SIGTERM. Performing graceful shutdown.

> > >>

> > >> 08/18/16 17:20:49 shutdown graceful

> > >>

> > >> And indeed all jobs running in vannilla universe (we have no others)

> > >>

> > >> are killed directly and started from the beginning. This is what a

> > >>

> > >> graceful shutdown will do with vanilla jobs. But I want to have a

> > >> peaceful shutdown.

> > >>

> > >> Is a peaceful shutdown not possible for vanilla jobs?

> > >>

> > >> Do I have to change the configuration? We use:

> > >>

> > >> PREEMPT = FALSE

> > >>

> > >> PREEMPTION_REQUIREMENTS = False

> > >>

> > >> KILL = FALSE

> > >>

> > >> WANT_SUSPEND = false

> > >>

> > >> WANT_VACATE = false

> > >>

> > >> Or can I use just a different command?

> > >>

> > >> We use condor 8.4.8 on debian 8 (AMD64).

> > >>

> > >> Thanks

> > >>

> > >> Harald

> > >>

> > >>

> > >>

> > >> _______________________________________________

> > >> HTCondor-users mailing list

> > >> To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx

> > >> with a subject: Unsubscribe

> > >> You can also unsubscribe by visiting

> > >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

> > >>

> > >> The archives can be found at:

> > >> https://lists.cs.wisc.edu/archive/htcondor-users/

> > >

> > > _______________________________________________

> > > HTCondor-users mailing list

> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx

> > > with a subject: Unsubscribe

> > > You can also unsubscribe by visiting

> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

> > >

> > > The archives can be found at:

> > > https://lists.cs.wisc.edu/archive/htcondor-users/

 

--

Harald van Pee

 

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn

Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505

mail: pee@xxxxxxxxxxxxxxxxx