[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] does condor_off -peaceful -daemon startd node; works for vanilla jobs?



@Bop: I also give the command from the central manager.

 

@Todd:

I have no MaxJobRetirementTime defined (nothing with retire or time found on

condor_config*, not on node, scheduler or central manager.

 

condor_status| grep node

slot1@node LINUX X86_64 Unclaimed Idle 0.230 63507 0+00:00:04

slot1_1@node LINUX X86_64 Claimed Busy 0.000 1024 0+00:00:03

 

after

condor_off -peaceful -daemon startd node

condor_status shows no node anymore (within 1 second, as fast as I can type).

 

We use

 

CLAIM_WORKLIFE = 120

and

STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

 

NUM_SLOTS = 1

SLOT_TYPE_1 = 100%

SLOT_TYPE_1_PARTITIONABLE = true

NUM_SLOTS_TYPE_1 = 1

 

Any help is welcome.

 

Harald

 

On Thursday 18 August 2016 18:29:04 Todd Tannenbaum wrote:

> As another data point, it also seemed to work for me running a

> pre-release of HTCondor v8.5.7 on Scientific Linux 6.8.

> Behold the simple test below; note the node went from Claimed/Busy to

> Claimed/Retiring, which is expected. "Retiring" activity is defined in

> the Manual (from https://is.gd/mi7mVk ):

>

> Retiring

> When an active claim is about to be preempted for any reason, it enters

> retirement, while it waits for the current job to finish. The

> MaxJobRetirementTime _expression_ determines how long to wait (counting

> since the time the job started). Once the job finishes or the retirement

> time expires, the Preempting state is entered.

>

> Perhaps you have a MaxJobRetirementTime defined ?

>

> regards,

> Todd

>

> [tannenba@localhost test]$ condor_status

> Name OpSys Arch State Activity LoadAv Mem

> ActvtyTime

>

> slot1@localhost LINUX X86_64 Claimed Busy 0.000 330

> 0+00:00:04 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000

> 330 0+00:00:05 slot3@localhost LINUX X86_64 Unclaimed Idle

> 0.000 330 0+00:00:06

>

> Total Owner Claimed Unclaimed Matched Preempting

> Backfill Drain

>

> X86_64/LINUX 3 0 1 2 0 0

> 0 0

>

> Total 3 0 1 2 0 0

> 0 0

>

> [tannenba@localhost test]$ condor_off -peaceful -daemon startd

> Sent "Set-Peaceful-Shutdown" command to local startd

> Sent "Kill-Daemon-Peacefully" command to local master

>

> [tannenba@localhost test]$ condor_status

> Name OpSys Arch State Activity LoadAv Mem

> ActvtyTime

>

> slot1@localhost LINUX X86_64 Claimed Retiring 0.000 330

> 0+00:00:03 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000

> 330 0+00:02:49 slot3@localhost LINUX X86_64 Unclaimed Idle

> 0.000 330 0+00:00:06

>

> Total Owner Claimed Unclaimed Matched Preempting

> Backfill Drain

>

> X86_64/LINUX 3 0 1 2 0 0

> 0 0

>

> Total 3 0 1 2 0 0

> 0 0

>

> On 8/18/2016 11:11 AM, Bob Ball wrote:

> > Just as a data point, I do, from our central manager machine,

> > condor_off -peaceful -daemon startd -name $publicName

> > and it runs just fine. All our jobs are vanilla. HTCondor is version

> > 8.4.6 on Scientific Linux.

> >

> > bob

> >

> > On 8/18/2016 11:54 AM, Harald van Pee wrote:

> >> Hi,

> >>

> >> I want to set a job running node offline, but only after all running

> >> jobs have finished. Of course until then no new jobs should be

> >> accepted on that node.

> >>

> >> I tried the command:

> >>

> >> condor_off -peaceful -daemon startd node

> >>

> >> and got the message:

> >>

> >> Sent "Set-Peaceful-Shutdown" command to startd node

> >>

> >> Sent "Kill-Daemon-Peacefully" command to master node

> >>

> >> On node I see in StartLog

> >>

> >> 08/18/16 17:20:49 Got SIGTERM. Performing graceful shutdown.

> >>

> >> 08/18/16 17:20:49 shutdown graceful

> >>

> >> And indeed all jobs running in vannilla universe (we have no others)

> >>

> >> are killed directly and started from the beginning. This is what a

> >>

> >> graceful shutdown will do with vanilla jobs. But I want to have a

> >> peaceful shutdown.

> >>

> >> Is a peaceful shutdown not possible for vanilla jobs?

> >>

> >> Do I have to change the configuration? We use:

> >>

> >> PREEMPT = FALSE

> >>

> >> PREEMPTION_REQUIREMENTS = False

> >>

> >> KILL = FALSE

> >>

> >> WANT_SUSPEND = false

> >>

> >> WANT_VACATE = false

> >>

> >> Or can I use just a different command?

> >>

> >> We use condor 8.4.8 on debian 8 (AMD64).

> >>

> >> Thanks

> >>

> >> Harald

> >>

> >>

> >>

> >> _______________________________________________

> >> HTCondor-users mailing list

> >> To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx

> >> with a subject: Unsubscribe

> >> You can also unsubscribe by visiting

> >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

> >>

> >> The archives can be found at:

> >> https://lists.cs.wisc.edu/archive/htcondor-users/

> >

> > _______________________________________________

> > HTCondor-users mailing list

> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with

> > a subject: Unsubscribe

> > You can also unsubscribe by visiting

> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

> >

> > The archives can be found at:

> > https://lists.cs.wisc.edu/archive/htcondor-users/

 

--

Harald van Pee

 

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn

Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505

mail: pee@xxxxxxxxxxxxxxxxx