[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs vacating on schedd restart



Michael/Todd:

The puppet code you wrote runs "service condor stop" and "start" or its systemd equivalents upon the notificaiton. What you want is something more like this:

 exec { 'update-condor-configuration':
  command   => '/usr/sbin/condor_reconfig',
  user    Â=> root,
  refreshonly => true,
  subscribe  => File['/etc/condor'],
  logoutput  => 'on_failure',
  require   => Package['condor'],
 }

Where I manage all files under /etc/condor using a recursive File resource.

I deliberately don't ensure condor is running using a Service resource as I don't want puppet to ever consider (re)starting Condor. I'm also not sure what "service condor status" returns for various states of the master and subdaemons,.

Tom

--
Tom Downes
Senior Scientist and Data CenterÂManager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678

On Wed, Jan 18, 2017 at 10:43 AM, Fischer, Max (SCC) <max.fischer@xxxxxxx> wrote:
Hi Todd,

yes, that seems to be the difference:
- On SL6, the init.d script kills the master with SIGQUIT, i.e. fast [1].
- On SL7, the condor.service systemd definition uses
    ExecStart=/usr/sbin/condor_master -f
    ExecStop=/usr/sbin/condor_off -master

I will just patch in the "-fast" flag for now. Some thoughts for solving this cleanly:
- Startds will kill of their jobs immediately with -fast, correct? So it's probably not a good idea to roll this out to hosts without a schedd?
- Would setting SCHEDD.SHUTDOWN_GRACEFUL_TIMEOUT = 0 allow the schedd to always fast-terminate, while keeping the startd shutdown graceful? Or does it need some time to detect the timeout?
- If anyone reads this and cares just about puppet (not manual restart), have a look at [2]. Probably the cleanest solution in most cases.

Cheers,
Max

[1]
http://research.cs.wisc.edu/htcondor/manual/current/3_9DaemonCore.html#SECTION00491000000000000000

[2]
https://github.com/HEP-Puppet/htcondor/blob/master/manifests/config.pp#L435


> Am 18.01.2017 um 13:32 schrieb Todd Tannenbaum <tannenba@xxxxxxxxxxx>:
>
> Just did a quick scan of the below, and my first thought is the schedd is being told to shutdown "gracefully" (as evidenced by the SIGTERM message in [3] below).
>
> If on a schedd machine you do a
>Â condor_off -fast
> or a
>Â condor_restart -fast
> then the schedd will exit, leave all the jobs running on the execute nodes, and on subsequent restart the schedd will attempt to reconnect to the running jobs if the job lease has not expired. Sounds like you desire the -fast behavior.
>
> If you omit the "-fast" argument above, the schedd will attempt to vacate all claimed machines before exiting, thus moving all running jobs back to idle. Any jobs that can checkpoint will be allowed to do so.
>
> My guess from a quick read is on SL6 the service interface scripts are using "-fast" with condor_off/restart, but the SL7 systemctl interface scripts are not?
>
> Personally I think the "-fast" behavior should be the default for off/restart, since I think that is what most admins would want most of the time.
>
> Hope the above helps
> Todd
>
>
> Sent from my iPhone
>
>> On Jan 18, 2017, at 6:03 AM, Fischer, Max (SCC) <max.fischer@xxxxxxx> wrote:
>>
>> To clarify "we would rather prefer jobs to restart":
>> I would expect shadows to restart and reconnect to jobs. Jobs on the startd should keep on running for some time without the shadow.
>>
>> Cheerio,
>> Max
>>
>>> Am 18.01.2017 um 11:45 schrieb Fischer, Max (SCC) <max.fischer@xxxxxxx>:
>>>
>>> Hi all,
>>>
>>> We've seen a massive number jobs vacating after a schedd restart (via Puppet). This seems to have affected many but *not all* jobs.
>>> While we know the trigger, the behaviour doesn't make sense for us. On another cluster used for testing, schedds restart frequently without evicting jobs.
>>> Since such change-config-and-restart are not uncommon, we would rather prefer jobs to restart...
>>>
>>> Doing manual restart via systemctl/service leads to the same results. The main difference between clusters is the OS.
>>> I'm not aware of any configuration switches for this, so I cannot rule out that one exists and is set differently.
>>>
>>> Failing Schedd:
>>>Â ÂOS SL7
>>>Â ÂCondor 8.4.10
>>>Â Âsystemctl
>>> Testing Schedd:
>>>Â ÂOS SL6
>>>Â ÂCondor 8.4.10
>>>Â Âservice
>>>
>>> Short story of what happened, logs below:
>>> - Puppet was told to implement config changes related to History rotation [1]
>>> - Puppet ran, changed configs, and notified/restarted condor service [2]
>>> - the local Master is told to shut down, and shuts down the schedd [3]
>>>Â Â+ we should probably change permissions
>>> - the schedd is told to shut down, and shuts down the shadows [4]
>>>Â Â+ The line "Called preempt( 1 ) for a graceful shutdown" is not present on our test cluster when restarting
>>> - the shadows stop [5]
>>>Â Â+ Our test cluster has no equivalent of the eviction messages
>>>Â Â+ The message indicates that eviction is unexpected
>>>Â Â+ The signal sent to the shadows is the same on both clusters
>>> - the remote startd disables the claim and evicts the job [6]
>>>Â Â+ Obviously, we do NOT see this in the test cluster
>>>Â Â+ This happens at the same time as the Schedd/Shadow dies, not delays or timeouts
>>>Â Â+ The starter just dies
>>> - the schedd restarts, and Shadows try to reconnect [7]
>>> - the schedd writes its restart log for reconnection [8]
>>>Â Â+ Extracted from another host, but same behaviour
>>>Â Â+ Lots of jobs seem to have survived
>>>
>>> I'm rather confused that the Schedd/Shadow seem not to expect jobs to fail, and even try to reconnect them. As far as I can tell, it is the Schedd/Shadow that trigger jobs to vacate.
>>>
>>> Cheers,
>>> Max
>>>
>>> [1]
>>> SCHEDD.ROTATE_HISTORY_DAILY = True
>>> SCHEDD.MAX_HISTORY_LOG = 100 * $(MEGABYTE)
>>> SCHEDD.MAX_HISTORY_ROTATIONS = 4
>>>
>>> [2]
>>> file {'/etc/condor/config.d/060.condor_schedd.cfg':
>>>Â content => template('middleware_arc_ce/condor/condor_schedd.erb'),
>>> mode  => '0644',
>>> notify => Service['condor'],
>>> }
>>> # ....
>>> service{'condor':
>>>Â ensure => 'running',
>>>Â enable => true,
>>> }
>>>
>>> [3]
>>> 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) PERMISSION DENIED to condor@xxxxxxxxx from host 10.97.13.133 for command 60005 (DC_OFF_GRACEFUL), access level ADMINISTRATOR: reason: ADMINISTRATOR authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.97.13.133,arc-1-kit.gridka.de, hostname size = 1, original ip address = 10.97.13.133
>>> 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) DC_AUTHENTICATE: Command not authorized, done!
>>> 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) Got SIGTERM. Performing graceful shutdown.
>>> 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) Sent SIGTERM to SCHEDD (pid 1712004)
>>>
>>> [4]
>>> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Got SIGTERM. Performing graceful shutdown.
>>> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Cron: Killing all jobs
>>> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Called preempt( 1 ) for a graceful shutdown
>>> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.106.176:17048?addrs=10.1.106.176-17048> [pid 3694145] for job 592252.0
>>> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.106.117:20778?addrs=10.1.106.117-20778> [pid 3694147] for job 592234.0
>>>
>>> [5] Failed Job
>>> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.103.101:20740?addrs=10.1.103.101-20740> [pid 2312872] for job 598067.0
>>> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS|D_FAILURE) Shadow pid 2312872 died with signal 9 (Killed)
>>> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Match record (slot1@xxxxxxxxxxxxxxxxxxxxx <10.1.103.101:20740?addrs=10.1.103.101-20740> for SGM.default.SC.arc, 598067.0) deleted
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): Got SIGTERM. Performing graceful shutdown.
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): Job 598067.0 is being evicted from slot1@xxxxxxxxxxxxxxxxxxxxx
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): logEvictEvent with unknown reason (-1), not logging.
>>>
>>> [6]
>>> 01/16/17 13:54:29 (pid:23886) (D_ALWAYS) slot1_11: Called deactivate_claim()
>>> 01/16/17 13:54:29 (pid:23886) (D_ALWAYS) slot1_11: Changing state and activity: Claimed/Busy -> Preempting/Vacating
>>> 01/16/17 13:54:59 (pid:23886) (D_ALWAYS) slot1_11: Got DEACTIVATE_CLAIM while in Preempting state, ignoring.
>>> 01/16/17 13:54:59 (pid:23886) (D_ALWAYS) slot1_11: Got RELEASE_CLAIM while in Preempting state, ignoring.
>>> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: starter exited
>>> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: No preempting claim, returning to owner
>>> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state and activity: Preempting/Vacating -> Owner/Idle
>>> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: IS_OWNER is false
>>> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state: Owner -> Unclaimed
>>> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state: Unclaimed -> Delete
>>> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Resource no longer needed, deleting
>>>
>>> [7] Failed Job Reconnect
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) Initializing a VANILLA shadow for job 598067.0
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Trying to reconnect to disconnected job
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): LastJobLeaseRenewal: 1484571266 Mon Jan 16 13:54:26 2017
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration: 2400 seconds
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration remaining: 2346
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to locate disconnected starter
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Found starter: <10.1.103.101:20845?addrs=10.1.103.101-20845>
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to reconnect to starter <10.1.103.101:20845?addrs=10.1.103.101-20845>
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Reconnect SUCCESS: connection re-established
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): File transfer completed successfully.
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Can no longer talk to condor_starter <10.1.103.101:20845>
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration remaining: 2400
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to locate disconnected starter
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): locateStarter(): ClaimId (<10.1.103.101:20740>#1480163328#11511#[Integrity="YES";Encryption="NO";CryptoMethods="3DES";]d48495f13fb528b07349833922f07601c37379d9) and GlobalJobId ( arc-1-kit.gridka.de#598067.0#1484571243 ) not found
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Reconnect FAILED: Job not found at execution machine
>>> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Exiting with JOB_SHOULD_REQUEUE
>>>
>>> [8] /var/log/condor/ScheddRestartReport
>>> The schedd arc-2-kit.gridka.de restarted at 01/16/17 14:01:02.
>>> It attempted to reconnect to machines where its jobs may still be running.
>>> All reconnect attempts have finished.
>>> Here is a summary of the reconnect attempts:
>>>
>>> 1425 total jobs where reconnecting is possible
>>> 0 reconnects are still being attempted
>>> 0 reconnects weren't attempted because the lease expired before the schedd restarted
>>> 377 reconnects failed
>>> 0 reconnects were interrupted by job removal or other event
>>> 1048 reconnects succeeded_______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/