[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs vacating on schedd restart



To clarify "we would rather prefer jobs to restart":
I would expect shadows to restart and reconnect to jobs. Jobs on the startd should keep on running for some time without the shadow.

Cheerio,
Max

> Am 18.01.2017 um 11:45 schrieb Fischer, Max (SCC) <max.fischer@xxxxxxx>:
> 
> Hi all,
> 
> We've seen a massive number jobs vacating after a schedd restart (via Puppet). This seems to have affected many but *not all* jobs.
> While we know the trigger, the behaviour doesn't make sense for us. On another cluster used for testing, schedds restart frequently without evicting jobs.
> Since such change-config-and-restart are not uncommon, we would rather prefer jobs to restart...
> 
> Doing manual restart via systemctl/service leads to the same results. The main difference between clusters is the OS.
> I'm not aware of any configuration switches for this, so I cannot rule out that one exists and is set differently.
> 
> Failing Schedd:
> 	OS SL7
> 	Condor 8.4.10
> 	systemctl
> Testing Schedd:
> 	OS SL6
> 	Condor 8.4.10
> 	service
> 
> Short story of what happened, logs below:
> - Puppet was told to implement config changes related to History rotation [1]
> - Puppet ran, changed configs, and notified/restarted condor service [2]
> - the local Master is told to shut down, and shuts down the schedd [3]
> 	+ we should probably change permissions
> - the schedd is told to shut down, and shuts down the shadows [4]
> 	+ The line "Called preempt( 1 ) for a graceful shutdown" is not present on our test cluster when restarting
> - the shadows stop [5]
> 	+ Our test cluster has no equivalent of the eviction messages
> 	+ The message indicates that eviction is unexpected
> 	+ The signal sent to the shadows is the same on both clusters
> - the remote startd disables the claim and evicts the job [6]
> 	+ Obviously, we do NOT see this in the test cluster
> 	+ This happens at the same time as the Schedd/Shadow dies, not delays or timeouts
> 	+ The starter just dies
> - the schedd restarts, and Shadows try to reconnect [7]
> - the schedd writes its restart log for reconnection [8]
> 	+ Extracted from another host, but same behaviour
> 	+ Lots of jobs seem to have survived
> 
> I'm rather confused that the Schedd/Shadow seem not to expect jobs to fail, and even try to reconnect them. As far as I can tell, it is the Schedd/Shadow that trigger jobs to vacate.
> 
> Cheers,
> Max
> 
> [1]
> SCHEDD.ROTATE_HISTORY_DAILY = True
> SCHEDD.MAX_HISTORY_LOG = 100 * $(MEGABYTE)
> SCHEDD.MAX_HISTORY_ROTATIONS = 4
> 
> [2]
>  file {'/etc/condor/config.d/060.condor_schedd.cfg':
>    content => template('middleware_arc_ce/condor/condor_schedd.erb'),
>    mode    => '0644',
>    notify  => Service['condor'],
>  }
>  # ....
>  service{'condor':
>    ensure => 'running',
>    enable => true,
>  }
> 
> [3]
> 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) PERMISSION DENIED to condor@xxxxxxxxx from host 10.97.13.133 for command 60005 (DC_OFF_GRACEFUL), access level ADMINISTRATOR: reason: ADMINISTRATOR authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.97.13.133,arc-1-kit.gridka.de, hostname size = 1, original ip address = 10.97.13.133
> 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) DC_AUTHENTICATE: Command not authorized, done!
> 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) Got SIGTERM. Performing graceful shutdown.
> 01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) Sent SIGTERM to SCHEDD (pid 1712004)
> 
> [4]
> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Got SIGTERM. Performing graceful shutdown.
> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Cron: Killing all jobs
> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Called preempt( 1 ) for a graceful shutdown
> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.106.176:17048?addrs=10.1.106.176-17048> [pid 3694145] for job 592252.0
> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.106.117:20778?addrs=10.1.106.117-20778> [pid 3694147] for job 592234.0
> 
> [5] Failed Job
> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.103.101:20740?addrs=10.1.103.101-20740> [pid 2312872] for job 598067.0
> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS|D_FAILURE) Shadow pid 2312872 died with signal 9 (Killed)
> /var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Match record (slot1@xxxxxxxxxxxxxxxxxxxxx <10.1.103.101:20740?addrs=10.1.103.101-20740> for SGM.default.SC.arc, 598067.0) deleted
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): Got SIGTERM. Performing graceful shutdown.
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): Job 598067.0 is being evicted from slot1@xxxxxxxxxxxxxxxxxxxxx
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): logEvictEvent with unknown reason (-1), not logging.
> 
> [6]
> 01/16/17 13:54:29 (pid:23886) (D_ALWAYS) slot1_11: Called deactivate_claim()
> 01/16/17 13:54:29 (pid:23886) (D_ALWAYS) slot1_11: Changing state and activity: Claimed/Busy -> Preempting/Vacating
> 01/16/17 13:54:59 (pid:23886) (D_ALWAYS) slot1_11: Got DEACTIVATE_CLAIM while in Preempting state, ignoring.
> 01/16/17 13:54:59 (pid:23886) (D_ALWAYS) slot1_11: Got RELEASE_CLAIM while in Preempting state, ignoring.
> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: starter exited
> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: No preempting claim, returning to owner
> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: IS_OWNER is false
> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state: Owner -> Unclaimed
> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state: Unclaimed -> Delete
> 01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Resource no longer needed, deleting
> 
> [7] Failed Job Reconnect
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) Initializing a VANILLA shadow for job 598067.0
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Trying to reconnect to disconnected job
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): LastJobLeaseRenewal: 1484571266 Mon Jan 16 13:54:26 2017
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration: 2400 seconds
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration remaining: 2346
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to locate disconnected starter
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Found starter: <10.1.103.101:20845?addrs=10.1.103.101-20845>
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to reconnect to starter <10.1.103.101:20845?addrs=10.1.103.101-20845>
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Reconnect SUCCESS: connection re-established
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): File transfer completed successfully.
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Can no longer talk to condor_starter <10.1.103.101:20845>
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration remaining: 2400
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to locate disconnected starter
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): locateStarter(): ClaimId (<10.1.103.101:20740>#1480163328#11511#[Integrity="YES";Encryption="NO";CryptoMethods="3DES";]d48495f13fb528b07349833922f07601c37379d9) and GlobalJobId ( arc-1-kit.gridka.de#598067.0#1484571243 ) not found
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Reconnect FAILED: Job not found at execution machine
> /var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Exiting with JOB_SHOULD_REQUEUE
> 
> [8] /var/log/condor/ScheddRestartReport
> The schedd arc-2-kit.gridka.de restarted at 01/16/17 14:01:02.
> It attempted to reconnect to machines where its jobs may still be running.
> All reconnect attempts have finished.
> Here is a summary of the reconnect attempts:
> 
> 1425 total jobs where reconnecting is possible
> 0 reconnects are still being attempted
> 0 reconnects weren't attempted because the lease expired before the schedd restarted
> 377 reconnects failed
> 0 reconnects were interrupted by job removal or other event
> 1048 reconnects succeeded_______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature