[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Daemons getting killed after boot startup?



Hi all,

I have spawned a fresh scheduler [1] whose daemons seem to always
getting killed shortly after they got created during a reboot. AFAIS,
the daemons [2,3] get a SIGQUIT from the daemon core [4] - however, I do
not get, why it triggered the actual shutdown [5].

After manually (re)starting condor's service, the daemons are running
stable, so I wonder, why they got killed reproducible after their first
start following reboots?

Cheers and thanks for ideas,
  Thomas

[1]
condor-external-libs-8.6.8-2.el7.x86_64
condor-python-8.6.8-2.el7.x86_64
condor-8.6.8-2.el7.x86_64
condor-classads-8.6.8-2.el7.x86_64
condor-procd-8.6.8-2.el7.x86_64

[2]
> Master aka PID:2358
>> MasterLog
...
03/26/18 11:08:06 Started DaemonCore process
"/usr/libexec/condor/condor_defrag", pid and pgroup = 2406
03/26/18 11:08:47 Got SIGQUIT.  Performing fast shutdown.
03/26/18 11:08:47 Sent SIGQUIT to DEFRAG (pid 2406)
03/26/18 11:08:47 Sent SIGQUIT to SCHEDD (pid 2405)
03/26/18 11:08:47 AllReaper unexpectedly called on pid 2405, status 0.
03/26/18 11:08:47 The SCHEDD (pid 2405) exited with status 0
03/26/18 11:08:47 AllReaper unexpectedly called on pid 2406, status 0.
03/26/18 11:08:47 The DEFRAG (pid 2406) exited with status 0
03/26/18 11:08:47 Sent SIGTERM to SHARED_PORT (pid 2398)
03/26/18 11:08:47 AllReaper unexpectedly called on pid 2398, status 0.
03/26/18 11:08:47 The SHARED_PORT (pid 2398) exited with status 0
03/26/18 11:08:47 All daemons are gone.  Exiting.
03/26/18 11:08:47 **** condor_master (condor_MASTER) pid 2358 EXITING
WITH STATUS 0

[3]
Sched aka PID:2405
>> SchedLog
...
03/26/18 11:08:12 (pid:2405) TransferQueueManager download 1m I/O load:
0 bytes/s  0.000 disk load  0.000 net load
03/26/18 11:08:47 (pid:2405) Got SIGQUIT.  Performing fast shutdown.
03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
03/26/18 11:08:47 (pid:2405) All shadows have been killed, exiting.
03/26/18 11:08:47 (pid:2405) **** condor_schedd (condor_SCHEDD) pid 2405
EXITING WITH STATUS 0
03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs
03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs

[4]
> ProcLog
...
03/26/18 11:08:05 : no methods have determined process 2131 to be in a
monitored family
03/26/18 11:08:05 : ...snapshot complete
03/26/18 11:08:05 : PROC_FAMILY_REGISTER_SUBFAMILY
03/26/18 11:08:05 : taking a snapshot...
03/26/18 11:08:05 : method PARENT: found family 2358 for process 2398
03/26/18 11:08:05 : method PARENT: found family 2358 for process 2398
(already determined)
03/26/18 11:08:05 : ...snapshot complete
03/26/18 11:08:05 : moving process 2398 into new subfamily 2398
03/26/18 11:08:05 : new subfamily registered: root = 2398, watcher = 2358
03/26/18 11:08:05 : PROC_FAMILY_TRACK_FAMILY_VIA_ENVIRONMENT
03/26/18 11:08:06 : PROC_FAMILY_REGISTER_SUBFAMILY
03/26/18 11:08:06 : taking a snapshot...
03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405
03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405
(already determined)
...
03/26/18 11:08:06 : PROC_FAMILY_TRACK_FAMILY_VIA_ENVIRONMENT
03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY
03/26/18 11:08:47 : taking a snapshot...
03/26/18 11:08:47 : process 2406 (of family 2406) has exited
03/26/18 11:08:47 : process 2405 (of family 2405) has exited
03/26/18 11:08:47 : process 1982 (not in monitored family) has exited
03/26/18 11:08:47 : process 1763 (not in monitored family) has exited
03/26/18 11:08:47 : process 1738 (not in monitored family) has exited
03/26/18 11:08:47 : process 1310 (not in monitored family) has exited
03/26/18 11:08:47 : process 542 (not in monitored family) has exited
03/26/18 11:08:47 : no methods have determined process 2413 to be in a
monitored family
03/26/18 11:08:47 : no methods have determined process 2416 to be in a
monitored family
03/26/18 11:08:47 : no methods have determined process 2417 to be in a
monitored family
03/26/18 11:08:47 : no methods have determined process 2697 to be in a
monitored family
03/26/18 11:08:47 : no methods have determined process 2745 to be in a
monitored family
03/26/18 11:08:47 : no methods have determined process 2964 to be in a
monitored family
03/26/18 11:08:47 : no methods have determined process 3004 to be in a
monitored family
03/26/18 11:08:47 : ...snapshot complete
03/26/18 11:08:47 : sending signal 9 to family with root 2405
03/26/18 11:08:47 : PROC_FAMILY_UNREGISTER_FAMILY
03/26/18 11:08:47 : unregistering family with root pid 2405
03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY
03/26/18 11:08:47 : taking a snapshot...
03/26/18 11:08:47 : ...snapshot complete
03/26/18 11:08:47 : sending signal 9 to family with root 2406
03/26/18 11:08:47 : PROC_FAMILY_UNREGISTER_FAMILY
03/26/18 11:08:47 : unregistering family with root pid 2406
03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY
03/26/18 11:08:47 : taking a snapshot...
03/26/18 11:08:47 : process 2398 (of family 2398) has exited
03/26/18 11:08:47 : ...snapshot complete
03/26/18 11:08:47 : sending signal 9 to family with root 2398
03/26/18 11:08:47 : PROC_FAMILY_QUIT

[5]
> Sched aka PID:2405
>> grep "2405" ./*
./MasterLog:03/26/18 11:08:06 Started DaemonCore process
"/usr/sbin/condor_schedd", pid and pgroup = 2405
./MasterLog:03/26/18 11:08:47 Sent SIGQUIT to SCHEDD (pid 2405)
./MasterLog:03/26/18 11:08:47 AllReaper unexpectedly called on pid 2405,
status 0.
./MasterLog:03/26/18 11:08:47 The SCHEDD (pid 2405) exited with status 0
./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
process 2405
./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
process 2405 (already determined)
./ProcLog:03/26/18 11:08:06 : moving process 2405 into new subfamily 2405
./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2405,
watcher = 2358
./ProcLog:03/26/18 11:08:47 : process 2405 (of family 2405) has exited
./ProcLog:03/26/18 11:08:47 : sending signal 9 to family with root 2405
./ProcLog:03/26/18 11:08:47 : unregistering family with root pid 2405
./SchedLog:03/26/18 11:08:06 (pid:2405) Setting maximum file descriptors
to 4096.
./SchedLog:03/26/18 11:08:06 (pid:2405)
******************************************************
./SchedLog:03/26/18 11:08:06 (pid:2405) ** condor_schedd (CONDOR_SCHEDD)
STARTING UP
./SchedLog:03/26/18 11:08:06 (pid:2405) ** /usr/sbin/condor_schedd
./SchedLog:03/26/18 11:08:06 (pid:2405) ** SubsystemInfo: name=SCHEDD
type=SCHEDD(5) class=DAEMON(1)
./SchedLog:03/26/18 11:08:06 (pid:2405) ** Configuration:
subsystem:SCHEDD local:<NONE> class:DAEMON
./SchedLog:03/26/18 11:08:06 (pid:2405) ** $CondorVersion: 8.6.8 Nov 13
2017 BuildID: 424045 $
./SchedLog:03/26/18 11:08:06 (pid:2405) ** $CondorPlatform: x86_64_RedHat7 $
./SchedLog:03/26/18 11:08:06 (pid:2405) ** PID = 2405
./SchedLog:03/26/18 11:08:06 (pid:2405) ** Log last touched 3/26 11:06:38
./SchedLog:03/26/18 11:08:06 (pid:2405)
******************************************************
./SchedLog:03/26/18 11:08:06 (pid:2405) Using config source:
/etc/condor/condor_config
./SchedLog:03/26/18 11:08:06 (pid:2405) Using local config sources:
./SchedLog:03/26/18 11:08:06 (pid:2405)
/etc/condor/config.d/00arc_ce.conf
./SchedLog:03/26/18 11:08:06 (pid:2405)
/etc/condor/config.d/02submitd.conf
./SchedLog:03/26/18 11:08:06 (pid:2405)
/etc/condor/config.d/04defragd.conf
./SchedLog:03/26/18 11:08:06 (pid:2405)    /etc/condor/condor_config.local
./SchedLog:03/26/18 11:08:06 (pid:2405) config Macros = 98, Sorted = 98,
StringBytes = 4110, TablesBytes = 3600
./SchedLog:03/26/18 11:08:06 (pid:2405) CLASSAD_CACHING is ENABLED
./SchedLog:03/26/18 11:08:06 (pid:2405) Daemon Log is logging: D_ALWAYS
D_ERROR
./SchedLog:03/26/18 11:08:06 (pid:2405) SharedPortEndpoint: waiting for
connections to named socket 2358_f868_3
./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: command socket at
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3>
./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: private command
socket at
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3>
./SchedLog:03/26/18 11:08:06 (pid:2405) History file rotation is enabled.
./SchedLog:03/26/18 11:08:06 (pid:2405)   Maximum history file size is:
50000000 bytes
./SchedLog:03/26/18 11:08:06 (pid:2405)   Number of rotated history
files is: 5
./SchedLog:03/26/18 11:08:06 (pid:2405) my_popenv: Failed to exec in
child, errno=2 (No such file or directory)
./SchedLog:03/26/18 11:08:06 (pid:2405) Failed to execute
/usr/sbin/condor_shadow.std, ignoring
./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager stats:
active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager upload 1m
I/O load: 0 bytes/s  0.000 disk load  0.000 net load
./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager download 1m
I/O load: 0 bytes/s  0.000 disk load  0.000 net load
./SchedLog:03/26/18 11:08:47 (pid:2405) Got SIGQUIT.  Performing fast
shutdown.
./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
./SchedLog:03/26/18 11:08:47 (pid:2405) All shadows have been killed,
exiting.
./SchedLog:03/26/18 11:08:47 (pid:2405) **** condor_schedd
(condor_SCHEDD) pid 2405 EXITING WITH STATUS 0
./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
./SchedLog:03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs
./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
./SchedLog:03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs

[6]
> Master aka PID:2358
>> grep 2358 ./*
./MasterLog:03/26/18 11:08:04 ** PID = 2358
./MasterLog:03/26/18 11:08:05 SharedPortEndpoint: waiting for
connections to named socket 2358_f868
./MasterLog:03/26/18 11:08:05 DaemonCore: private command socket at
<131.169.223.234:0?sock=2358_f868>
./MasterLog:03/26/18 11:08:47 **** condor_master (condor_MASTER) pid
2358 EXITING WITH STATUS 0
./ProcLog:03/26/18 11:08:05 : Procd has a watcher pid and will die if
pid 2358 dies.
./ProcLog:03/26/18 11:08:05 : method PID: found family 2358 for process 2358
./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for
process 2397
./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for
process 2397 (already determined)
./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for
process 2398
./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for
process 2398 (already determined)
./ProcLog:03/26/18 11:08:05 : new subfamily registered: root = 2398,
watcher = 2358
./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
process 2405
./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
process 2405 (already determined)
./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2405,
watcher = 2358
./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
process 2406
./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
process 2406 (already determined)
./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2406,
watcher = 2358
./SchedLog:03/26/18 11:08:06 (pid:2405) SharedPortEndpoint: waiting for
connections to named socket 2358_f868_3
./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: command socket at
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3>
./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: private command
socket at
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3>



Attachment: smime.p7s
Description: S/MIME Cryptographic Signature