[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] does condor_off -peaceful -daemon startd node; works for vanilla jobs?



On 8/18/2016 12:50 PM, Harald van Pee wrote:
Hi,

here just more information, one can see all happens within a second and
all jobs are gone (or restarted on another node).

[snip]

08/18/16 19:43:03 slot1_1: Changing activity: Busy -> Retiring

08/18/16 19:43:03 slot1_1: State change: claim retirement ended/expired


^^^ This line is the smoking gun. This is saying that either the job or the slot has defined a MaxJobRetirementTime, and that the job has already been running for longer than this defined period, so the startd immediately leaves Claimed/Retiring state and goes to Preempting/Vacating (which sends a SIGTERM to the job).

For example, I get the exact same results as you when I do a condor_off -peaceful, with the exact same messages in the StartLog, if I submit a job that looks like the following:

  Executable = /bin/sleep
  Arguments = 60000
  # Allow the job to be preempted if HTCondor wants
  # to shutdown or run a higher priority job if and
  # only if this job has already run for more than
  # one second.
  MaxJobRetirementTime = 1
  Queue

In your tests, are you positive that neither the job(s) being preempted nor the execute node where the condor_startd is running define MaxJobRetirementTime ? Because it really looks that way to me. To check the job, use condor_q (or condor_history if the job left the queue) and pass "-af MaxJobRetirementTime" command-line arg. To check the condor_startd, if you are logged into the execute node that kicked off the job, try

  condor_config_val -startd -dump maxjobretirementtime


regards,
Todd


08/18/16 19:43:03 slot1_1: Changing state and activity: Claimed/Retiring
-> Preempting/Vacating

08/18/16 19:43:03 PERMISSION DENIED to submit-side@matchsession from
host 192.168.xxx.xxx for command 403 (DEACTIVATE_CLAIM), access level
DAEMON: reason: cached result for DAEMON; see first case for the full reason

08/18/16 19:43:03 slot1_1: Got DEACTIVATE_CLAIM while in Preempting
state, ignoring.

08/18/16 19:43:03 Starter pid 6873 exited with status 0

08/18/16 19:43:03 slot1_1: State change: starter exited

08/18/16 19:43:03 slot1_1: State change: No preempting claim, returning
to owner

08/18/16 19:43:03 slot1_1: Changing state and activity:
Preempting/Vacating -> Owner/Idle

08/18/16 19:43:03 slot1_1: State change: IS_OWNER is false

08/18/16 19:43:03 slot1_1: Changing state: Owner -> Unclaimed

08/18/16 19:43:03 slot1_1: Changing state: Unclaimed -> Delete

08/18/16 19:43:03 slot1_1: Resource no longer needed, deleting

08/18/16 19:43:03 Deleting cron job manager

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 CronJobList: Deleting all jobs

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 CronJobList: Deleting all jobs

08/18/16 19:43:03 Deleting benchmark job mgr

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 Killing job mips

08/18/16 19:43:03 Killing job kflops

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 Killing job mips

08/18/16 19:43:03 Killing job kflops

08/18/16 19:43:03 CronJobList: Deleting all jobs

08/18/16 19:43:03 CronJobList: Deleting job 'mips'

08/18/16 19:43:03 CronJob: Deleting job 'mips'
(/usr/lib/condor/libexec/condor_mips), timer -1

08/18/16 19:43:03 CronJobList: Deleting job 'kflops'

08/18/16 19:43:03 CronJob: Deleting job 'kflops'
(/usr/lib/condor/libexec/condor_kflops), timer -1

08/18/16 19:43:03 Cron: Killing all jobs

08/18/16 19:43:03 CronJobList: Deleting all jobs

08/18/16 19:43:03 All resources are free, exiting.

08/18/16 19:43:03 **** condor_startd (condor_STARTD) pid 6818 EXITING
WITH STATUS 0

On Thursday 18 August 2016 19:12:55 Harald van Pee wrote:

 > @Bop: I also give the command from the central manager.

 >

 > @Todd:

 > I have no MaxJobRetirementTime defined (nothing with retire or time found

 > on condor_config*, not on node, scheduler or central manager.

 >

 > condor_status| grep node

 > slot1@node LINUX X86_64 Unclaimed Idle 0.230 63507 0+00:00:04

 > slot1_1@node LINUX X86_64 Claimed Busy 0.000 1024 0+00:00:03

 >

 > after

 > condor_off -peaceful -daemon startd node

 > condor_status shows no node anymore (within 1 second, as fast as I can

 > type).

 >

 > We use

 >

 > CLAIM_WORKLIFE = 120

 > and

 > STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

 >

 > NUM_SLOTS = 1

 > SLOT_TYPE_1 = 100%

 > SLOT_TYPE_1_PARTITIONABLE = true

 > NUM_SLOTS_TYPE_1 = 1

 >

 > Any help is welcome.

 >

 > Harald

 >

 > On Thursday 18 August 2016 18:29:04 Todd Tannenbaum wrote:

 > > As another data point, it also seemed to work for me running a

 > > pre-release of HTCondor v8.5.7 on Scientific Linux 6.8.

 > > Behold the simple test below; note the node went from Claimed/Busy to

 > > Claimed/Retiring, which is expected. "Retiring" activity is defined in

 > >

 > > the Manual (from https://is.gd/mi7mVk ):

 > > Retiring

 > >

 > > When an active claim is about to be preempted for any reason, it

 > > enters

 > >

 > > retirement, while it waits for the current job to finish. The

 > > MaxJobRetirementTime expression determines how long to wait (counting

 > > since the time the job started). Once the job finishes or the
retirement

 > > time expires, the Preempting state is entered.

 > >

 > > Perhaps you have a MaxJobRetirementTime defined ?

 > >

 > > regards,

 > > Todd

 > >

 > > [tannenba@localhost test]$ condor_status

 > > Name OpSys Arch State Activity LoadAv Mem

 > > ActvtyTime

 > >

 > > slot1@localhost LINUX X86_64 Claimed Busy 0.000 330

 > > 0+00:00:04 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000

 > > 330 0+00:00:05 slot3@localhost LINUX X86_64 Unclaimed Idle

 > > 0.000 330 0+00:00:06

 > >

 > > Total Owner Claimed Unclaimed Matched Preempting

 > >

 > > Backfill Drain

 > >

 > > X86_64/LINUX 3 0 1 2 0 0

 > >

 > > 0 0

 > >

 > > Total 3 0 1 2 0 0

 > >

 > > 0 0

 > >

 > > [tannenba@localhost test]$ condor_off -peaceful -daemon startd

 > > Sent "Set-Peaceful-Shutdown" command to local startd

 > > Sent "Kill-Daemon-Peacefully" command to local master

 > >

 > > [tannenba@localhost test]$ condor_status

 > > Name OpSys Arch State Activity LoadAv Mem

 > > ActvtyTime

 > >

 > > slot1@localhost LINUX X86_64 Claimed Retiring 0.000 330

 > > 0+00:00:03 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000

 > > 330 0+00:02:49 slot3@localhost LINUX X86_64 Unclaimed Idle

 > > 0.000 330 0+00:00:06

 > >

 > > Total Owner Claimed Unclaimed Matched Preempting

 > >

 > > Backfill Drain

 > >

 > > X86_64/LINUX 3 0 1 2 0 0

 > >

 > > 0 0

 > >

 > > Total 3 0 1 2 0 0

 > >

 > > 0 0

 > >

 > > On 8/18/2016 11:11 AM, Bob Ball wrote:

 > > > Just as a data point, I do, from our central manager machine,

 > > > condor_off -peaceful -daemon startd -name $publicName

 > > > and it runs just fine. All our jobs are vanilla. HTCondor is version

 > > > 8.4.6 on Scientific Linux.

 > > >

 > > > bob

 > > >

 > > > On 8/18/2016 11:54 AM, Harald van Pee wrote:

 > > >> Hi,

 > > >>

 > > >> I want to set a job running node offline, but only after all running

 > > >> jobs have finished. Of course until then no new jobs should be

 > > >> accepted on that node.

 > > >>

 > > >> I tried the command:

 > > >>

 > > >> condor_off -peaceful -daemon startd node

 > > >>

 > > >> and got the message:

 > > >>

 > > >> Sent "Set-Peaceful-Shutdown" command to startd node

 > > >>

 > > >> Sent "Kill-Daemon-Peacefully" command to master node

 > > >>

 > > >> On node I see in StartLog

 > > >>

 > > >> 08/18/16 17:20:49 Got SIGTERM. Performing graceful shutdown.

 > > >>

 > > >> 08/18/16 17:20:49 shutdown graceful

 > > >>

 > > >> And indeed all jobs running in vannilla universe (we have no others)

 > > >>

 > > >> are killed directly and started from the beginning. This is what a

 > > >>

 > > >> graceful shutdown will do with vanilla jobs. But I want to have a

 > > >> peaceful shutdown.

 > > >>

 > > >> Is a peaceful shutdown not possible for vanilla jobs?

 > > >>

 > > >> Do I have to change the configuration? We use:

 > > >>

 > > >> PREEMPT = FALSE

 > > >>

 > > >> PREEMPTION_REQUIREMENTS = False

 > > >>

 > > >> KILL = FALSE

 > > >>

 > > >> WANT_SUSPEND = false

 > > >>

 > > >> WANT_VACATE = false

 > > >>

 > > >> Or can I use just a different command?

 > > >>

 > > >> We use condor 8.4.8 on debian 8 (AMD64).

 > > >>

 > > >> Thanks

 > > >>

 > > >> Harald

 > > >>

 > > >>

 > > >>

 > > >> _______________________________________________

 > > >> HTCondor-users mailing list

 > > >> To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx

 > > >> with a subject: Unsubscribe

 > > >> You can also unsubscribe by visiting

 > > >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

 > > >>

 > > >> The archives can be found at:

 > > >> https://lists.cs.wisc.edu/archive/htcondor-users/

 > > >

 > > > _______________________________________________

 > > > HTCondor-users mailing list

 > > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx

 > > > with a subject: Unsubscribe

 > > > You can also unsubscribe by visiting

 > > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

 > > >

 > > > The archives can be found at:

 > > > https://lists.cs.wisc.edu/archive/htcondor-users/

--

Harald van Pee

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn

Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505

mail: pee@xxxxxxxxxxxxxxxxx



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685