Hi,
here just more information, one can see all happens within a second and all jobs are gone (or restarted on another node).
I give the command on the central manager:
Thu Aug 18 19:43:03 CEST 2016 Sent "Set-Peaceful-Shutdown" command to startd node Sent "Kill-Daemon-Peacefully" command to master node Thu Aug 18 19:43:03 CEST 2016
And I see on /var/log/condor/StartLog on node: 08/18/16 19:43:03 Got SIGTERM. Performing graceful shutdown. 08/18/16 19:43:03 shutdown graceful 08/18/16 19:43:03 Cron: Killing all jobs 08/18/16 19:43:03 Cron: Killing all jobs 08/18/16 19:43:03 Killing job mips 08/18/16 19:43:03 Killing job kflops 08/18/16 19:43:03 slot1_1: Changing activity: Busy -> Retiring 08/18/16 19:43:03 slot1_1: State change: claim retirement ended/expired 08/18/16 19:43:03 slot1_1: Changing state and activity: Claimed/Retiring -> Preempting/Vacating 08/18/16 19:43:03 PERMISSION DENIED to submit-side@matchsession from host 192.168.xxx.xxx for command 403 (DEACTIVATE_CLAIM), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 08/18/16 19:43:03 slot1_1: Got DEACTIVATE_CLAIM while in Preempting state, ignoring. 08/18/16 19:43:03 Starter pid 6873 exited with status 0 08/18/16 19:43:03 slot1_1: State change: starter exited 08/18/16 19:43:03 slot1_1: State change: No preempting claim, returning to owner 08/18/16 19:43:03 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle 08/18/16 19:43:03 slot1_1: State change: IS_OWNER is false 08/18/16 19:43:03 slot1_1: Changing state: Owner -> Unclaimed 08/18/16 19:43:03 slot1_1: Changing state: Unclaimed -> Delete 08/18/16 19:43:03 slot1_1: Resource no longer needed, deleting 08/18/16 19:43:03 Deleting cron job manager 08/18/16 19:43:03 Cron: Killing all jobs 08/18/16 19:43:03 Cron: Killing all jobs 08/18/16 19:43:03 CronJobList: Deleting all jobs 08/18/16 19:43:03 Cron: Killing all jobs 08/18/16 19:43:03 CronJobList: Deleting all jobs 08/18/16 19:43:03 Deleting benchmark job mgr 08/18/16 19:43:03 Cron: Killing all jobs 08/18/16 19:43:03 Killing job mips 08/18/16 19:43:03 Killing job kflops 08/18/16 19:43:03 Cron: Killing all jobs 08/18/16 19:43:03 Killing job mips 08/18/16 19:43:03 Killing job kflops 08/18/16 19:43:03 CronJobList: Deleting all jobs 08/18/16 19:43:03 CronJobList: Deleting job 'mips' 08/18/16 19:43:03 CronJob: Deleting job 'mips' (/usr/lib/condor/libexec/condor_mips), timer -1 08/18/16 19:43:03 CronJobList: Deleting job 'kflops' 08/18/16 19:43:03 CronJob: Deleting job 'kflops' (/usr/lib/condor/libexec/condor_kflops), timer -1 08/18/16 19:43:03 Cron: Killing all jobs 08/18/16 19:43:03 CronJobList: Deleting all jobs 08/18/16 19:43:03 All resources are free, exiting. 08/18/16 19:43:03 **** condor_startd (condor_STARTD) pid 6818 EXITING WITH STATUS 0
On Thursday 18 August 2016 19:12:55 Harald van Pee wrote: > @Bop: I also give the command from the central manager. > > @Todd: > I have no MaxJobRetirementTime defined (nothing with retire or time found > on condor_config*, not on node, scheduler or central manager. > > condor_status| grep node > slot1@node LINUX X86_64 Unclaimed Idle 0.230 63507 0+00:00:04 > slot1_1@node LINUX X86_64 Claimed Busy 0.000 1024 0+00:00:03 > > after > condor_off -peaceful -daemon startd node > condor_status shows no node anymore (within 1 second, as fast as I can > type). > > We use > > CLAIM_WORKLIFE = 120 > and > STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler > > NUM_SLOTS = 1 > SLOT_TYPE_1 = 100% > SLOT_TYPE_1_PARTITIONABLE = true > NUM_SLOTS_TYPE_1 = 1 > > Any help is welcome. > > Harald > > On Thursday 18 August 2016 18:29:04 Todd Tannenbaum wrote: > > As another data point, it also seemed to work for me running a > > pre-release of HTCondor v8.5.7 on Scientific Linux 6.8. > > Behold the simple test below; note the node went from Claimed/Busy to > > Claimed/Retiring, which is expected. "Retiring" activity is defined in > > > > the Manual (from https://is.gd/mi7mVk ): > > Retiring > > > > When an active claim is about to be preempted for any reason, it > > enters > > > > retirement, while it waits for the current job to finish. The > > MaxJobRetirementTime _expression_ determines how long to wait (counting > > since the time the job started). Once the job finishes or the retirement > > time expires, the Preempting state is entered. > > > > Perhaps you have a MaxJobRetirementTime defined ? > > > > regards, > > Todd > > > > [tannenba@localhost test]$ condor_status > > Name OpSys Arch State Activity LoadAv Mem > > ActvtyTime > > > > slot1@localhost LINUX X86_64 Claimed Busy 0.000 330 > > 0+00:00:04 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000 > > 330 0+00:00:05 slot3@localhost LINUX X86_64 Unclaimed Idle > > 0.000 330 0+00:00:06 > > > > Total Owner Claimed Unclaimed Matched Preempting > > > > Backfill Drain > > > > X86_64/LINUX 3 0 1 2 0 0 > > > > 0 0 > > > > Total 3 0 1 2 0 0 > > > > 0 0 > > > > [tannenba@localhost test]$ condor_off -peaceful -daemon startd > > Sent "Set-Peaceful-Shutdown" command to local startd > > Sent "Kill-Daemon-Peacefully" command to local master > > > > [tannenba@localhost test]$ condor_status > > Name OpSys Arch State Activity LoadAv Mem > > ActvtyTime > > > > slot1@localhost LINUX X86_64 Claimed Retiring 0.000 330 > > 0+00:00:03 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000 > > 330 0+00:02:49 slot3@localhost LINUX X86_64 Unclaimed Idle > > 0.000 330 0+00:00:06 > > > > Total Owner Claimed Unclaimed Matched Preempting > > > > Backfill Drain > > > > X86_64/LINUX 3 0 1 2 0 0 > > > > 0 0 > > > > Total 3 0 1 2 0 0 > > > > 0 0 > > > > On 8/18/2016 11:11 AM, Bob Ball wrote: > > > Just as a data point, I do, from our central manager machine, > > > condor_off -peaceful -daemon startd -name $publicName > > > and it runs just fine. All our jobs are vanilla. HTCondor is version > > > 8.4.6 on Scientific Linux. > > > > > > bob > > > > > > On 8/18/2016 11:54 AM, Harald van Pee wrote: > > >> Hi, > > >> > > >> I want to set a job running node offline, but only after all running > > >> jobs have finished. Of course until then no new jobs should be > > >> accepted on that node. > > >> > > >> I tried the command: > > >> > > >> condor_off -peaceful -daemon startd node > > >> > > >> and got the message: > > >> > > >> Sent "Set-Peaceful-Shutdown" command to startd node > > >> > > >> Sent "Kill-Daemon-Peacefully" command to master node > > >> > > >> On node I see in StartLog > > >> > > >> 08/18/16 17:20:49 Got SIGTERM. Performing graceful shutdown. > > >> > > >> 08/18/16 17:20:49 shutdown graceful > > >> > > >> And indeed all jobs running in vannilla universe (we have no others) > > >> > > >> are killed directly and started from the beginning. This is what a > > >> > > >> graceful shutdown will do with vanilla jobs. But I want to have a > > >> peaceful shutdown. > > >> > > >> Is a peaceful shutdown not possible for vanilla jobs? > > >> > > >> Do I have to change the configuration? We use: > > >> > > >> PREEMPT = FALSE > > >> > > >> PREEMPTION_REQUIREMENTS = False > > >> > > >> KILL = FALSE > > >> > > >> WANT_SUSPEND = false > > >> > > >> WANT_VACATE = false > > >> > > >> Or can I use just a different command? > > >> > > >> We use condor 8.4.8 on debian 8 (AMD64). > > >> > > >> Thanks > > >> > > >> Harald > > >> > > >> > > >> > > >> _______________________________________________ > > >> HTCondor-users mailing list > > >> To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx > > >> with a subject: Unsubscribe > > >> You can also unsubscribe by visiting > > >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > >> > > >> The archives can be found at: > > >> https://lists.cs.wisc.edu/archive/htcondor-users/ > > > > > > _______________________________________________ > > > HTCondor-users mailing list > > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx > > > with a subject: Unsubscribe > > > You can also unsubscribe by visiting > > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > > > > > The archives can be found at: > > > https://lists.cs.wisc.edu/archive/htcondor-users/
-- Harald van Pee
Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505 mail: pee@xxxxxxxxxxxxxxxxx
|