[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] schedd restart does not respawn the shadows



Hi

the schedd is experiencing  a non standard behaviour. 

After a schedd restart or even a reboot of the server
all the already running shadows are not respawned anymore and the condor_q command does not report any running job.
The jobs keep running on the execution machines until the lease expiration.

I failed to reproduce this behaviour on a test schedd instance with the same configuration

thanks in advance for any hint you would like to share with me 

Ale

The following messages come from the production schedd with a non standard behaviour 

****
06/28/17 03:52:09 (pid:217464) Shadow pid 823969 for job 50113.0 exited with status 112
06/28/17 03:52:09 (pid:217464) Putting job 50113.0 on hold
06/28/17 05:11:02 (pid:1093541) Setting maximum file descriptors to 4096.
06/28/17 05:11:02 (pid:1093541) ******************************************************
06/28/17 05:11:02 (pid:1093541) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
06/28/17 05:11:02 (pid:1093541) ** /usr/sbin/condor_schedd
06/28/17 05:11:02 (pid:1093541) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
06/28/17 05:11:02 (pid:1093541) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
06/28/17 05:11:02 (pid:1093541) ** $CondorVersion: 8.4.6 Apr 20 2016 BuildID: 364106 $
06/28/17 05:11:02 (pid:1093541) ** $CondorPlatform: x86_64_RedHat6 $
06/28/17 05:11:02 (pid:1093541) ** PID = 1093541
06/28/17 05:11:02 (pid:1093541) ** Log last touched 6/28 03:52:09
06/28/17 05:11:02 (pid:1093541) ******************************************************
06/28/17 05:11:02 (pid:1093541) Using config source: /etc/condor/condor_config
06/28/17 05:11:02 (pid:1093541) Using local config sources:
06/28/17 05:11:02 (pid:1093541)    /etc/condor/config.d/condor_config_base
06/28/17 05:11:02 (pid:1093541)    /etc/condor/config.d/condor_config_history
06/28/17 05:11:02 (pid:1093541)    /etc/condor/config.d/condor_config_jobs
06/28/17 05:11:02 (pid:1093541)    /etc/condor/config.d/condor_config_scheduler
06/28/17 05:11:02 (pid:1093541)    /etc/condor/config.d/condor_config_security
06/28/17 05:11:02 (pid:1093541) config Macros = 88, Sorted = 88, StringBytes = 3048, TablesBytes = 3248
06/28/17 05:11:02 (pid:1093541) CLASSAD_CACHING is ENABLED
06/28/17 05:11:02 (pid:1093541) Daemon Log is logging: D_ALWAYS D_ERROR
06/28/17 05:11:02 (pid:1093541) SharedPortEndpoint: waiting for connections to named socket 217453_9047_12
06/28/17 05:11:02 (pid:1093541) DaemonCore: command socket at <90.147.169.224:9618?addrs=90.147.169.224-9618&noUDP&sock=217453_9047_12>
06/28/17 05:11:02 (pid:1093541) DaemonCore: private command socket at <90.147.169.224:9618?addrs=90.147.169.224-9618&noUDP&sock=217453_9047_12>
06/28/17 05:11:02 (pid:1093541) History file rotation is enabled.
06/28/17 05:11:02 (pid:1093541)   Maximum history file size is: 1073741824 bytes
06/28/17 05:11:02 (pid:1093541)   Number of rotated history files is: 365
06/28/17 05:11:02 (pid:1093541) Failed to execute /usr/sbin/condor_shadow.std, ignoring
06/28/17 05:11:37 (pid:1093541) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log
06/28/17 05:11:39 (pid:1093541) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
06/28/17 05:11:39 (pid:1093541) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
06/28/17 05:11:39 (pid:1093541) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
****


The following messages come from the test schedd with a standard behaviour [shadow are respawned] 
[in this example the schedd received a âkill -9"]

****
06/28/17 11:38:10 (pid:2206) Number of Active Workers 0
06/28/17 11:38:21 (pid:11089) Setting maximum file descriptors to 4096.
06/28/17 11:38:21 (pid:11089) ******************************************************
06/28/17 11:38:21 (pid:11089) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
06/28/17 11:38:21 (pid:11089) ** /usr/sbin/condor_schedd
06/28/17 11:38:21 (pid:11089) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
06/28/17 11:38:21 (pid:11089) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
06/28/17 11:38:21 (pid:11089) ** $CondorVersion: 8.4.9 Sep 29 2016 BuildID: 382747 $
06/28/17 11:38:21 (pid:11089) ** $CondorPlatform: x86_64_RedHat6 $
06/28/17 11:38:21 (pid:11089) ** PID = 11089
06/28/17 11:38:21 (pid:11089) ** Log last touched 6/28 11:38:10
06/28/17 11:38:21 (pid:11089) ******************************************************
06/28/17 11:38:21 (pid:11089) Using config source: /etc/condor/condor_config
06/28/17 11:38:21 (pid:11089) Using local config sources:
06/28/17 11:38:21 (pid:11089)    /etc/condor/config.d/condor_config_base
06/28/17 11:38:21 (pid:11089)    /etc/condor/config.d/condor_config_history
06/28/17 11:38:21 (pid:11089)    /etc/condor/config.d/condor_config_jobs
06/28/17 11:38:21 (pid:11089)    /etc/condor/config.d/condor_config_scheduler
06/28/17 11:38:21 (pid:11089)    /etc/condor/config.d/condor_config_security
06/28/17 11:38:21 (pid:11089)    /etc/condor/config.d/condor_config_sub_expr
06/28/17 11:38:21 (pid:11089) config Macros = 80, Sorted = 80, StringBytes = 2537, TablesBytes = 2968
06/28/17 11:38:21 (pid:11089) CLASSAD_CACHING is ENABLED
06/28/17 11:38:21 (pid:11089) Daemon Log is logging: D_ALWAYS D_ERROR
06/28/17 11:38:22 (pid:11089) SharedPortEndpoint: waiting for connections to named socket 2156_baae_5
06/28/17 11:38:22 (pid:11089) DaemonCore: command socket at <90.147.168.55:9618?addrs=90.147.168.55-9618&noUDP&sock=2156_baae_5>
06/28/17 11:38:22 (pid:11089) DaemonCore: private command socket at <90.147.168.55:9618?addrs=90.147.168.55-9618&noUDP&sock=2156_baae_5>
06/28/17 11:38:22 (pid:11089) History file rotation is enabled.
06/28/17 11:38:22 (pid:11089)   Maximum history file size is: 1073741824 bytes
06/28/17 11:38:22 (pid:11089)   Number of rotated history files is: 365
06/28/17 11:38:22 (pid:11089) Failed to execute /usr/sbin/condor_shadow.std, ignoring
06/28/17 11:38:22 (pid:11089) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log
06/28/17 11:38:22 (pid:11089) Starting add_shadow_birthdate(289.0)
06/28/17 11:38:22 (pid:11089) Started shadow for job 289.0 on <90.147.168.249:60611> for group_cms.local.italiano, (shadow pid = 11092)
06/28/17 11:38:22 (pid:11089) Starting add_shadow_birthdate(291.0)
06/28/17 11:38:22 (pid:11089) Started shadow for job 291.0 on <90.147.169.78:44253> for group_cms.local.italiano, (shadow pid = 11095)
06/28/17 11:38:22 (pid:11089) Starting add_shadow_birthdate(290.0)
06/28/17 11:38:22 (pid:11089) Started shadow for job 290.0 on <90.147.169.168:49712> for group_cms.local.italiano, (shadow pid = 11098)
06/28/17 11:38:22 (pid:11089) Starting add_shadow_birthdate(292.0)
06/28/17 11:38:22 (pid:11089) Started shadow for job 292.0 on <90.147.168.147:41189> for group_cms.local.italiano, (shadow pid = 11101)
06/28/17 11:38:27 (pid:11089) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
06/28/17 11:38:27 (pid:11089) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
06/28/17 11:38:27 (pid:11089) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
****

Attachment: smime.p7s
Description: S/MIME cryptographic signature