[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs vacating on schedd restart



Hi all,

We've seen a massive number jobs vacating after a schedd restart (via Puppet). This seems to have affected many but *not all* jobs.
While we know the trigger, the behaviour doesn't make sense for us. On another cluster used for testing, schedds restart frequently without evicting jobs.
Since such change-config-and-restart are not uncommon, we would rather prefer jobs to restart...

Doing manual restart via systemctl/service leads to the same results. The main difference between clusters is the OS.
I'm not aware of any configuration switches for this, so I cannot rule out that one exists and is set differently.

Failing Schedd:
	OS SL7
	Condor 8.4.10
	systemctl
Testing Schedd:
	OS SL6
	Condor 8.4.10
	service

Short story of what happened, logs below:
- Puppet was told to implement config changes related to History rotation [1]
- Puppet ran, changed configs, and notified/restarted condor service [2]
- the local Master is told to shut down, and shuts down the schedd [3]
	+ we should probably change permissions
- the schedd is told to shut down, and shuts down the shadows [4]
	+ The line "Called preempt( 1 ) for a graceful shutdown" is not present on our test cluster when restarting
- the shadows stop [5]
	+ Our test cluster has no equivalent of the eviction messages
	+ The message indicates that eviction is unexpected
	+ The signal sent to the shadows is the same on both clusters
- the remote startd disables the claim and evicts the job [6]
	+ Obviously, we do NOT see this in the test cluster
	+ This happens at the same time as the Schedd/Shadow dies, not delays or timeouts
	+ The starter just dies
- the schedd restarts, and Shadows try to reconnect [7]
- the schedd writes its restart log for reconnection [8]
	+ Extracted from another host, but same behaviour
	+ Lots of jobs seem to have survived

I'm rather confused that the Schedd/Shadow seem not to expect jobs to fail, and even try to reconnect them. As far as I can tell, it is the Schedd/Shadow that trigger jobs to vacate.

Cheers,
Max

[1]
SCHEDD.ROTATE_HISTORY_DAILY = True
SCHEDD.MAX_HISTORY_LOG = 100 * $(MEGABYTE)
SCHEDD.MAX_HISTORY_ROTATIONS = 4

[2]
  file {'/etc/condor/config.d/060.condor_schedd.cfg':
    content => template('middleware_arc_ce/condor/condor_schedd.erb'),
    mode    => '0644',
    notify  => Service['condor'],
  }
  # ....
  service{'condor':
    ensure => 'running',
    enable => true,
  }

[3]
01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) PERMISSION DENIED to condor@xxxxxxxxx from host 10.97.13.133 for command 60005 (DC_OFF_GRACEFUL), access level ADMINISTRATOR: reason: ADMINISTRATOR authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.97.13.133,arc-1-kit.gridka.de, hostname size = 1, original ip address = 10.97.13.133
01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) DC_AUTHENTICATE: Command not authorized, done!
01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) Got SIGTERM. Performing graceful shutdown.
01/16/17 13:54:29 (pid:1172021) (D_ALWAYS) Sent SIGTERM to SCHEDD (pid 1712004)

[4]
/var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Got SIGTERM. Performing graceful shutdown.
/var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Cron: Killing all jobs
/var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Called preempt( 1 ) for a graceful shutdown
/var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.106.176:17048?addrs=10.1.106.176-17048> [pid 3694145] for job 592252.0
/var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.106.117:20778?addrs=10.1.106.117-20778> [pid 3694147] for job 592234.0

[5] Failed Job
/var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Sent signal 9 to <10.1.103.101:20740?addrs=10.1.103.101-20740> [pid 2312872] for job 598067.0
/var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS|D_FAILURE) Shadow pid 2312872 died with signal 9 (Killed)
/var/log/condor/SchedLog.20170116T205919:01/16/17 13:54:29 (pid:1712004) (D_ALWAYS) Match record (slot1@xxxxxxxxxxxxxxxxxxxxx <10.1.103.101:20740?addrs=10.1.103.101-20740> for SGM.default.SC.arc, 598067.0) deleted
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): Got SIGTERM. Performing graceful shutdown.
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): Job 598067.0 is being evicted from slot1@xxxxxxxxxxxxxxxxxxxxx
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:54:29 (pid:2312872) (D_ALWAYS) (598067.0) (2312872): logEvictEvent with unknown reason (-1), not logging.

[6]
01/16/17 13:54:29 (pid:23886) (D_ALWAYS) slot1_11: Called deactivate_claim()
01/16/17 13:54:29 (pid:23886) (D_ALWAYS) slot1_11: Changing state and activity: Claimed/Busy -> Preempting/Vacating
01/16/17 13:54:59 (pid:23886) (D_ALWAYS) slot1_11: Got DEACTIVATE_CLAIM while in Preempting state, ignoring.
01/16/17 13:54:59 (pid:23886) (D_ALWAYS) slot1_11: Got RELEASE_CLAIM while in Preempting state, ignoring.
01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: starter exited
01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: No preempting claim, returning to owner
01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state and activity: Preempting/Vacating -> Owner/Idle
01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: State change: IS_OWNER is false
01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state: Owner -> Unclaimed
01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Changing state: Unclaimed -> Delete
01/16/17 13:55:01 (pid:23886) (D_ALWAYS) slot1_11: Resource no longer needed, deleting

[7] Failed Job Reconnect
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) Initializing a VANILLA shadow for job 598067.0
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Trying to reconnect to disconnected job
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): LastJobLeaseRenewal: 1484571266 Mon Jan 16 13:54:26 2017
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration: 2400 seconds
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration remaining: 2346
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to locate disconnected starter
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Found starter: <10.1.103.101:20845?addrs=10.1.103.101-20845>
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to reconnect to starter <10.1.103.101:20845?addrs=10.1.103.101-20845>
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Reconnect SUCCESS: connection re-established
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): File transfer completed successfully.
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Can no longer talk to condor_starter <10.1.103.101:20845>
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): JobLeaseDuration remaining: 2400
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Attempting to locate disconnected starter
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): locateStarter(): ClaimId (<10.1.103.101:20740>#1480163328#11511#[Integrity="YES";Encryption="NO";CryptoMethods="3DES";]d48495f13fb528b07349833922f07601c37379d9) and GlobalJobId ( arc-1-kit.gridka.de#598067.0#1484571243 ) not found
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Reconnect FAILED: Job not found at execution machine
/var/log/condor/ShadowLog.20170117T031919:01/16/17 13:55:20 (pid:2316830) (D_ALWAYS) (598067.0) (2316830): Exiting with JOB_SHOULD_REQUEUE

[8] /var/log/condor/ScheddRestartReport
The schedd arc-2-kit.gridka.de restarted at 01/16/17 14:01:02.
It attempted to reconnect to machines where its jobs may still be running.
All reconnect attempts have finished.
Here is a summary of the reconnect attempts:

1425 total jobs where reconnecting is possible
0 reconnects are still being attempted
0 reconnects weren't attempted because the lease expired before the schedd restarted
377 reconnects failed
0 reconnects were interrupted by job removal or other event
1048 reconnects succeeded

Attachment: smime.p7s
Description: S/MIME cryptographic signature