[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] does condor_off -peaceful -daemon startd node; works for vanilla jobs?



Hi,

 

logged into the execute node that kicked

off the job, I tried

condor_config_val -startd -dump maxjobretirementtime

 

result:

# Parameters with names that match maxjobretirementtime:

UWCS_MaxJobRetirementTime = 0

# Contributing configuration file(s):

# <Default>

 

could this cause the problem? It must be default because its not in the configuration files and not in the submit description files because it happens also for my jobs and I do not give any MaxJobRetirementTime.

 

What value would be best, we never want to stop any regular job

 

UWCS_MaxJobRetirementTime = -1 or just a big number?

 

Many thanks

Harald

 

 

On Thursday 18 August 2016 21:43:23 Todd Tannenbaum wrote:

> On 8/18/2016 12:50 PM, Harald van Pee wrote:

> > Hi,

> >

> > here just more information, one can see all happens within a second and

> > all jobs are gone (or restarted on another node).

>

> [snip]

>

> > 08/18/16 19:43:03 slot1_1: Changing activity: Busy -> Retiring

> >

> > 08/18/16 19:43:03 slot1_1: State change: claim retirement ended/expired

>

> ^^^ This line is the smoking gun. This is saying that either the job or

> the slot has defined a MaxJobRetirementTime, and that the job has

> already been running for longer than this defined period, so the startd

> immediately leaves Claimed/Retiring state and goes to

> Preempting/Vacating (which sends a SIGTERM to the job).

>

> For example, I get the exact same results as you when I do a condor_off

> -peaceful, with the exact same messages in the StartLog, if I submit a

> job that looks like the following:

>

> Executable = /bin/sleep

> Arguments = 60000

> # Allow the job to be preempted if HTCondor wants

> # to shutdown or run a higher priority job if and

> # only if this job has already run for more than

> # one second.

> MaxJobRetirementTime = 1

> Queue

>

> In your tests, are you positive that neither the job(s) being preempted

> nor the execute node where the condor_startd is running define

> MaxJobRetirementTime ? Because it really looks that way to me. To

> check the job, use condor_q (or condor_history if the job left the

> queue) and pass "-af MaxJobRetirementTime" command-line arg. To check

> the condor_startd, if you are logged into the execute node that kicked

> off the job, try

>

> condor_config_val -startd -dump maxjobretirementtime

>

>

> regards,

> Todd

>

> > 08/18/16 19:43:03 slot1_1: Changing state and activity: Claimed/Retiring

> > -> Preempting/Vacating

> >

> > 08/18/16 19:43:03 PERMISSION DENIED to submit-side@matchsession from

> > host 192.168.xxx.xxx for command 403 (DEACTIVATE_CLAIM), access level

> > DAEMON: reason: cached result for DAEMON; see first case for the full

> > reason

> >

> > 08/18/16 19:43:03 slot1_1: Got DEACTIVATE_CLAIM while in Preempting

> > state, ignoring.

> >

> > 08/18/16 19:43:03 Starter pid 6873 exited with status 0

> >

> > 08/18/16 19:43:03 slot1_1: State change: starter exited

> >

> > 08/18/16 19:43:03 slot1_1: State change: No preempting claim, returning

> > to owner

> >

> > 08/18/16 19:43:03 slot1_1: Changing state and activity:

> > Preempting/Vacating -> Owner/Idle

> >

> > 08/18/16 19:43:03 slot1_1: State change: IS_OWNER is false

> >

> > 08/18/16 19:43:03 slot1_1: Changing state: Owner -> Unclaimed

> >

> > 08/18/16 19:43:03 slot1_1: Changing state: Unclaimed -> Delete

> >

> > 08/18/16 19:43:03 slot1_1: Resource no longer needed, deleting

> >

> > 08/18/16 19:43:03 Deleting cron job manager

> >

> > 08/18/16 19:43:03 Cron: Killing all jobs

> >

> > 08/18/16 19:43:03 Cron: Killing all jobs

> >

> > 08/18/16 19:43:03 CronJobList: Deleting all jobs

> >

> > 08/18/16 19:43:03 Cron: Killing all jobs

> >

> > 08/18/16 19:43:03 CronJobList: Deleting all jobs

> >

> > 08/18/16 19:43:03 Deleting benchmark job mgr

> >

> > 08/18/16 19:43:03 Cron: Killing all jobs

> >

> > 08/18/16 19:43:03 Killing job mips

> >

> > 08/18/16 19:43:03 Killing job kflops

> >

> > 08/18/16 19:43:03 Cron: Killing all jobs

> >

> > 08/18/16 19:43:03 Killing job mips

> >

> > 08/18/16 19:43:03 Killing job kflops

> >

> > 08/18/16 19:43:03 CronJobList: Deleting all jobs

> >

> > 08/18/16 19:43:03 CronJobList: Deleting job 'mips'

> >

> > 08/18/16 19:43:03 CronJob: Deleting job 'mips'

> > (/usr/lib/condor/libexec/condor_mips), timer -1

> >

> > 08/18/16 19:43:03 CronJobList: Deleting job 'kflops'

> >

> > 08/18/16 19:43:03 CronJob: Deleting job 'kflops'

> > (/usr/lib/condor/libexec/condor_kflops), timer -1

> >

> > 08/18/16 19:43:03 Cron: Killing all jobs

> >

> > 08/18/16 19:43:03 CronJobList: Deleting all jobs

> >

> > 08/18/16 19:43:03 All resources are free, exiting.

> >

> > 08/18/16 19:43:03 **** condor_startd (condor_STARTD) pid 6818 EXITING

> > WITH STATUS 0

> >

> > On Thursday 18 August 2016 19:12:55 Harald van Pee wrote:

> > > @Bop: I also give the command from the central manager.

> > >

> > >

> > >

> > > @Todd:

> > >

> > > I have no MaxJobRetirementTime defined (nothing with retire or time

> > > found

> > >

> > > on condor_config*, not on node, scheduler or central manager.

> > >

> > >

> > >

> > > condor_status| grep node

> > >

> > > slot1@node LINUX X86_64 Unclaimed Idle 0.230 63507 0+00:00:04

> > >

> > > slot1_1@node LINUX X86_64 Claimed Busy 0.000 1024 0+00:00:03

> > >

> > >

> > >

> > > after

> > >

> > > condor_off -peaceful -daemon startd node

> > >

> > > condor_status shows no node anymore (within 1 second, as fast as I can

> > >

> > > type).

> > >

> > >

> > >

> > > We use

> > >

> > >

> > >

> > > CLAIM_WORKLIFE = 120

> > >

> > > and

> > >

> > > STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

> > >

> > >

> > >

> > > NUM_SLOTS = 1

> > >

> > > SLOT_TYPE_1 = 100%

> > >

> > > SLOT_TYPE_1_PARTITIONABLE = true

> > >

> > > NUM_SLOTS_TYPE_1 = 1

> > >

> > >

> > >

> > > Any help is welcome.

> > >

> > >

> > >

> > > Harald

> > >

> > > On Thursday 18 August 2016 18:29:04 Todd Tannenbaum wrote:

> > > > As another data point, it also seemed to work for me running a

> > > >

> > > > pre-release of HTCondor v8.5.7 on Scientific Linux 6.8.

> > > >

> > > > Behold the simple test below; note the node went from Claimed/Busy

> > > > to

> > > >

> > > > Claimed/Retiring, which is expected. "Retiring" activity is defined

> > > > in

> > > >

> > > >

> > > >

> > > > the Manual (from https://is.gd/mi7mVk ):

> > > >

> > > > Retiring

> > > >

> > > >

> > > >

> > > > When an active claim is about to be preempted for any reason, it

> > > >

> > > > enters

> > > >

> > > >

> > > >

> > > > retirement, while it waits for the current job to finish. The

> > > >

> > > > MaxJobRetirementTime _expression_ determines how long to wait

> > > > (counting

> > > >

> > > > since the time the job started). Once the job finishes or the

> >

> > retirement

> >

> > > > time expires, the Preempting state is entered.

> > > >

> > > >

> > > >

> > > > Perhaps you have a MaxJobRetirementTime defined ?

> > > >

> > > >

> > > >

> > > > regards,

> > > >

> > > > Todd

> > > >

> > > >

> > > >

> > > > [tannenba@localhost test]$ condor_status

> > > >

> > > > Name OpSys Arch State Activity LoadAv Mem

> > > >

> > > > ActvtyTime

> > > >

> > > >

> > > >

> > > > slot1@localhost LINUX X86_64 Claimed Busy 0.000 330

> > > >

> > > > 0+00:00:04 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000

> > > >

> > > > 330 0+00:00:05 slot3@localhost LINUX X86_64 Unclaimed Idle

> > > >

> > > > 0.000 330 0+00:00:06

> > > >

> > > >

> > > >

> > > > Total Owner Claimed Unclaimed Matched Preempting

> > > >

> > > >

> > > >

> > > > Backfill Drain

> > > >

> > > >

> > > >

> > > > X86_64/LINUX 3 0 1 2 0 0

> > > >

> > > >

> > > >

> > > > 0 0

> > > >

> > > >

> > > >

> > > > Total 3 0 1 2 0 0

> > > >

> > > >

> > > >

> > > > 0 0

> > > >

> > > >

> > > >

> > > > [tannenba@localhost test]$ condor_off -peaceful -daemon startd

> > > >

> > > > Sent "Set-Peaceful-Shutdown" command to local startd

> > > >

> > > > Sent "Kill-Daemon-Peacefully" command to local master

> > > >

> > > >

> > > >

> > > > [tannenba@localhost test]$ condor_status

> > > >

> > > > Name OpSys Arch State Activity LoadAv Mem

> > > >

> > > > ActvtyTime

> > > >

> > > >

> > > >

> > > > slot1@localhost LINUX X86_64 Claimed Retiring 0.000 330

> > > >

> > > > 0+00:00:03 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000

> > > >

> > > > 330 0+00:02:49 slot3@localhost LINUX X86_64 Unclaimed Idle

> > > >

> > > > 0.000 330 0+00:00:06

> > > >

> > > >

> > > >

> > > > Total Owner Claimed Unclaimed Matched Preempting

> > > >

> > > >

> > > >

> > > > Backfill Drain

> > > >

> > > >

> > > >

> > > > X86_64/LINUX 3 0 1 2 0 0

> > > >

> > > >

> > > >

> > > > 0 0

> > > >

> > > >

> > > >

> > > > Total 3 0 1 2 0 0

> > > >

> > > >

> > > >

> > > > 0 0

> > > >

> > > > On 8/18/2016 11:11 AM, Bob Ball wrote:

> > > > > Just as a data point, I do, from our central manager machine,

> > > > >

> > > > > condor_off -peaceful -daemon startd -name $publicName

> > > > >

> > > > > and it runs just fine. All our jobs are vanilla. HTCondor is

> > > > > version

> > > > >

> > > > > 8.4.6 on Scientific Linux.

> > > > >

> > > > >

> > > > >

> > > > > bob

> > > > >

> > > > > On 8/18/2016 11:54 AM, Harald van Pee wrote:

> > > > >> Hi,

> > > > >>

> > > > >>

> > > > >>

> > > > >> I want to set a job running node offline, but only after all

> > > > >> running

> > > > >>

> > > > >> jobs have finished. Of course until then no new jobs should be

> > > > >>

> > > > >> accepted on that node.

> > > > >>

> > > > >>

> > > > >>

> > > > >> I tried the command:

> > > > >>

> > > > >>

> > > > >>

> > > > >> condor_off -peaceful -daemon startd node

> > > > >>

> > > > >>

> > > > >>

> > > > >> and got the message:

> > > > >>

> > > > >>

> > > > >>

> > > > >> Sent "Set-Peaceful-Shutdown" command to startd node

> > > > >>

> > > > >>

> > > > >>

> > > > >> Sent "Kill-Daemon-Peacefully" command to master node

> > > > >>

> > > > >>

> > > > >>

> > > > >> On node I see in StartLog

> > > > >>

> > > > >>

> > > > >>

> > > > >> 08/18/16 17:20:49 Got SIGTERM. Performing graceful shutdown.

> > > > >>

> > > > >>

> > > > >>

> > > > >> 08/18/16 17:20:49 shutdown graceful

> > > > >>

> > > > >>

> > > > >>

> > > > >> And indeed all jobs running in vannilla universe (we have no

> > > > >> others)

> > > > >>

> > > > >>

> > > > >>

> > > > >> are killed directly and started from the beginning. This is what

> > > > >> a

> > > > >>

> > > > >>

> > > > >>

> > > > >> graceful shutdown will do with vanilla jobs. But I want to have a

> > > > >>

> > > > >> peaceful shutdown.

> > > > >>

> > > > >>

> > > > >>

> > > > >> Is a peaceful shutdown not possible for vanilla jobs?

> > > > >>

> > > > >>

> > > > >>

> > > > >> Do I have to change the configuration? We use:

> > > > >>

> > > > >>

> > > > >>

> > > > >> PREEMPT = FALSE

> > > > >>

> > > > >>

> > > > >>

> > > > >> PREEMPTION_REQUIREMENTS = False

> > > > >>

> > > > >>

> > > > >>

> > > > >> KILL = FALSE

> > > > >>

> > > > >>

> > > > >>

> > > > >> WANT_SUSPEND = false

> > > > >>

> > > > >>

> > > > >>

> > > > >> WANT_VACATE = false

> > > > >>

> > > > >>

> > > > >>

> > > > >> Or can I use just a different command?

> > > > >>

> > > > >>

> > > > >>

> > > > >> We use condor 8.4.8 on debian 8 (AMD64).

> > > > >>

> > > > >>

> > > > >>

> > > > >> Thanks

> > > > >>

> > > > >>

> > > > >>

> > > > >> Harald

> > > > >>

> > > > >>

> > > > >>

> > > > >>

> > > > >>

> > > > >>

> > > > >>

> > > > >> _______________________________________________

> > > > >>

> > > > >> HTCondor-users mailing list

> > > > >>

> > > > >> To unsubscribe, send a message

> > > > >> tohtcondor-users-request@xxxxxxxxxxx

> > > > >>

> > > > >> with a subject: Unsubscribe

> > > > >>

> > > > >> You can also unsubscribe by visiting

> > > > >>

> > > > >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

> > > > >>

> > > > >>

> > > > >>

> > > > >> The archives can be found at:

> > > > >>

> > > > >> https://lists.cs.wisc.edu/archive/htcondor-users/

> > > > >

> > > > > _______________________________________________

> > > > >

> > > > > HTCondor-users mailing list

> > > > >

> > > > > To unsubscribe, send a message to

> > > > > htcondor-users-request@xxxxxxxxxxx

> > > > >

> > > > > with a subject: Unsubscribe

> > > > >

> > > > > You can also unsubscribe by visiting

> > > > >

> > > > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

> > > > >

> > > > >

> > > > >

> > > > > The archives can be found at:

> > > > >

> > > > > https://lists.cs.wisc.edu/archive/htcondor-users/

> >

> > --

> >

> > Harald van Pee

> >

> > Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn

> >

> > Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505

> >

> > mail: pee@xxxxxxxxxxxxxxxxx

> >

> >

> >

> > _______________________________________________

> > HTCondor-users mailing list

> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with

> > a subject: Unsubscribe

> > You can also unsubscribe by visiting

> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

> >

> > The archives can be found at:

> > https://lists.cs.wisc.edu/archive/htcondor-users/

 

--

Harald van Pee

 

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn

Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505

mail: pee@xxxxxxxxxxxxxxxxx