[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Daemons getting killed after boot startup?



Hi Todd,

thanks for the info! Judging from the periodicity in the journal, it
looks pretty much that it could be the bug [1] ;)

In comparison to the recently spawned sched with condor-8.6.8-2, I do
not see the the issue on a sibling with condor-8.6.8-1.
However, the working sched as has recently updated systemd packages to
systemd-*-219-42.el7_4.10 as well - but I have not rebooted it since to
pick it up... So might be also some dependency on systemd versions
convoluted??

Anyway, I updated the machine to 8.6.10-1 and will keep an eye on it ;)

Cheers and thanks,
  Thomas

[1]
Mar 26 11:11:12 grid-vm08.desy.de systemd[1]: Started Condor Distributed
High-Throughput-Computing.
Mar 26 11:11:12 grid-vm08.desy.de systemd[1]: Starting Condor
Distributed High-Throughput-Computing...
Mar 26 11:38:54 grid-vm08.desy.de systemd[1]: Stopping Condor
Distributed High-Throughput-Computing...
Mar 26 11:38:54 grid-vm08.desy.de systemd[1]: Stopped Condor Distributed
High-Throughput-Computing.
Mar 26 12:11:22 grid-vm08.desy.de systemd[1]: Started Condor Distributed
High-Throughput-Computing.
Mar 26 12:11:22 grid-vm08.desy.de systemd[1]: Starting Condor
Distributed High-Throughput-Computing...
Mar 26 12:38:56 grid-vm08.desy.de systemd[1]: Stopping Condor
Distributed High-Throughput-Computing...
Mar 26 12:38:56 grid-vm08.desy.de systemd[1]: Stopped Condor Distributed
High-Throughput-Computing.


On 2018-03-26 15:26, Todd Tannenbaum wrote:
> 
> 
> On Mar 26, 2018, at 6:57 AM, Thomas Hartmann <thomas.hartmann@xxxxxxx
> <mailto:thomas.hartmann@xxxxxxx>> wrote:
> 
>> Hi all,
>>
>> I have spawned a fresh scheduler [1] whose daemons seem to always
>> getting killed shortly after they got created during a reboot. AFAIS,
>> the daemons [2,3] get a SIGQUIT from the daemon core [4] - however, I do
>> not get, why it triggered the actual shutdown [5].
>>
>> After manually (re)starting condor's service, the daemons are running
>> stable, so I wonder, why they got killed reproducible after their first
>> start following reboots?
>>
>> Cheers and thanks for ideas,
>> ÂThomas
> 
> Hi Thomas,
> 
> Something is sending the condor_master a SIGQUIT signal, which results
> in the master shutting down everything.Â
> 
> I wonder if you are being hit by this bug which was fixed in HTCondor
> v8.6.9:
> 
> Â ÂÂhttps://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6476
> 
> In v8.6.8 and earlier, systemd would send a sigquit to the master 20
> minutes (by default) after either a condor_restart or after the
> condor_master binary was touched/changed. Â To confirm It would be
> useful to see more of your MasterLog, esp for 25 minutes before it
> receives the SIGQUIT. And/or check your systemd logs. Or just upgrade
> and see if it goes away :)
> 
> Best regards,
> Todd
> 
> 
>>
>> [1]
>> condor-external-libs-8.6.8-2.el7.x86_64
>> condor-python-8.6.8-2.el7.x86_64
>> condor-8.6.8-2.el7.x86_64
>> condor-classads-8.6.8-2.el7.x86_64
>> condor-procd-8.6.8-2.el7.x86_64
>>
>> [2]
>>> Master aka PID:2358
>>>> MasterLog
>> ...
>> 03/26/18 11:08:06 Started DaemonCore process
>> "/usr/libexec/condor/condor_defrag", pid and pgroup = 2406
>> 03/26/18 11:08:47 Got SIGQUIT. ÂPerforming fast shutdown.
>> 03/26/18 11:08:47 Sent SIGQUIT to DEFRAG (pid 2406)
>> 03/26/18 11:08:47 Sent SIGQUIT to SCHEDD (pid 2405)
>> 03/26/18 11:08:47 AllReaper unexpectedly called on pid 2405, status 0.
>> 03/26/18 11:08:47 The SCHEDD (pid 2405) exited with status 0
>> 03/26/18 11:08:47 AllReaper unexpectedly called on pid 2406, status 0.
>> 03/26/18 11:08:47 The DEFRAG (pid 2406) exited with status 0
>> 03/26/18 11:08:47 Sent SIGTERM to SHARED_PORT (pid 2398)
>> 03/26/18 11:08:47 AllReaper unexpectedly called on pid 2398, status 0.
>> 03/26/18 11:08:47 The SHARED_PORT (pid 2398) exited with status 0
>> 03/26/18 11:08:47 All daemons are gone. ÂExiting.
>> 03/26/18 11:08:47 **** condor_master (condor_MASTER) pid 2358 EXITING
>> WITH STATUS 0
>>
>> [3]
>> Sched aka PID:2405
>>>> SchedLog
>> ...
>> 03/26/18 11:08:12 (pid:2405) TransferQueueManager download 1m I/O load:
>> 0 bytes/s Â0.000 disk load Â0.000 net load
>> 03/26/18 11:08:47 (pid:2405) Got SIGQUIT. ÂPerforming fast shutdown.
>> 03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
>> 03/26/18 11:08:47 (pid:2405) All shadows have been killed, exiting.
>> 03/26/18 11:08:47 (pid:2405) **** condor_schedd (condor_SCHEDD) pid 2405
>> EXITING WITH STATUS 0
>> 03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
>> 03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs
>> 03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
>> 03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs
>>
>> [4]
>>> ProcLog
>> ...
>> 03/26/18 11:08:05 : no methods have determined process 2131 to be in a
>> monitored family
>> 03/26/18 11:08:05 : ...snapshot complete
>> 03/26/18 11:08:05 : PROC_FAMILY_REGISTER_SUBFAMILY
>> 03/26/18 11:08:05 : taking a snapshot...
>> 03/26/18 11:08:05 : method PARENT: found family 2358 for process 2398
>> 03/26/18 11:08:05 : method PARENT: found family 2358 for process 2398
>> (already determined)
>> 03/26/18 11:08:05 : ...snapshot complete
>> 03/26/18 11:08:05 : moving process 2398 into new subfamily 2398
>> 03/26/18 11:08:05 : new subfamily registered: root = 2398, watcher = 2358
>> 03/26/18 11:08:05 : PROC_FAMILY_TRACK_FAMILY_VIA_ENVIRONMENT
>> 03/26/18 11:08:06 : PROC_FAMILY_REGISTER_SUBFAMILY
>> 03/26/18 11:08:06 : taking a snapshot...
>> 03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405
>> 03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405
>> (already determined)
>> ...
>> 03/26/18 11:08:06 : PROC_FAMILY_TRACK_FAMILY_VIA_ENVIRONMENT
>> 03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY
>> 03/26/18 11:08:47 : taking a snapshot...
>> 03/26/18 11:08:47 : process 2406 (of family 2406) has exited
>> 03/26/18 11:08:47 : process 2405 (of family 2405) has exited
>> 03/26/18 11:08:47 : process 1982 (not in monitored family) has exited
>> 03/26/18 11:08:47 : process 1763 (not in monitored family) has exited
>> 03/26/18 11:08:47 : process 1738 (not in monitored family) has exited
>> 03/26/18 11:08:47 : process 1310 (not in monitored family) has exited
>> 03/26/18 11:08:47 : process 542 (not in monitored family) has exited
>> 03/26/18 11:08:47 : no methods have determined process 2413 to be in a
>> monitored family
>> 03/26/18 11:08:47 : no methods have determined process 2416 to be in a
>> monitored family
>> 03/26/18 11:08:47 : no methods have determined process 2417 to be in a
>> monitored family
>> 03/26/18 11:08:47 : no methods have determined process 2697 to be in a
>> monitored family
>> 03/26/18 11:08:47 : no methods have determined process 2745 to be in a
>> monitored family
>> 03/26/18 11:08:47 : no methods have determined process 2964 to be in a
>> monitored family
>> 03/26/18 11:08:47 : no methods have determined process 3004 to be in a
>> monitored family
>> 03/26/18 11:08:47 : ...snapshot complete
>> 03/26/18 11:08:47 : sending signal 9 to family with root 2405
>> 03/26/18 11:08:47 : PROC_FAMILY_UNREGISTER_FAMILY
>> 03/26/18 11:08:47 : unregistering family with root pid 2405
>> 03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY
>> 03/26/18 11:08:47 : taking a snapshot...
>> 03/26/18 11:08:47 : ...snapshot complete
>> 03/26/18 11:08:47 : sending signal 9 to family with root 2406
>> 03/26/18 11:08:47 : PROC_FAMILY_UNREGISTER_FAMILY
>> 03/26/18 11:08:47 : unregistering family with root pid 2406
>> 03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY
>> 03/26/18 11:08:47 : taking a snapshot...
>> 03/26/18 11:08:47 : process 2398 (of family 2398) has exited
>> 03/26/18 11:08:47 : ...snapshot complete
>> 03/26/18 11:08:47 : sending signal 9 to family with root 2398
>> 03/26/18 11:08:47 : PROC_FAMILY_QUIT
>>
>> [5]
>>> Sched aka PID:2405
>>>> grep "2405" ./*
>> ./MasterLog:03/26/18 11:08:06 Started DaemonCore process
>> "/usr/sbin/condor_schedd", pid and pgroup = 2405
>> ./MasterLog:03/26/18 11:08:47 Sent SIGQUIT to SCHEDD (pid 2405)
>> ./MasterLog:03/26/18 11:08:47 AllReaper unexpectedly called on pid 2405,
>> status 0.
>> ./MasterLog:03/26/18 11:08:47 The SCHEDD (pid 2405) exited with status 0
>> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
>> process 2405
>> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
>> process 2405 (already determined)
>> ./ProcLog:03/26/18 11:08:06 : moving process 2405 into new subfamily 2405
>> ./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2405,
>> watcher = 2358
>> ./ProcLog:03/26/18 11:08:47 : process 2405 (of family 2405) has exited
>> ./ProcLog:03/26/18 11:08:47 : sending signal 9 to family with root 2405
>> ./ProcLog:03/26/18 11:08:47 : unregistering family with root pid 2405
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) Setting maximum file descriptors
>> to 4096.
>> ./SchedLog:03/26/18 11:08:06 (pid:2405)
>> ******************************************************
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** condor_schedd (CONDOR_SCHEDD)
>> STARTING UP
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** /usr/sbin/condor_schedd
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** SubsystemInfo: name=SCHEDD
>> type=SCHEDD(5) class=DAEMON(1)
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** Configuration:
>> subsystem:SCHEDD local:<NONE> class:DAEMON
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** $CondorVersion: 8.6.8 Nov 13
>> 2017 BuildID: 424045 $
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** $CondorPlatform:
>> x86_64_RedHat7 $
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** PID = 2405
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** Log last touched 3/26 11:06:38
>> ./SchedLog:03/26/18 11:08:06 (pid:2405)
>> ******************************************************
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) Using config source:
>> /etc/condor/condor_config
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) Using local config sources:
>> ./SchedLog:03/26/18 11:08:06 (pid:2405)
>> /etc/condor/config.d/00arc_ce.conf
>> ./SchedLog:03/26/18 11:08:06 (pid:2405)
>> /etc/condor/config.d/02submitd.conf
>> ./SchedLog:03/26/18 11:08:06 (pid:2405)
>> /etc/condor/config.d/04defragd.conf
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) ÂÂÂ/etc/condor/condor_config.local
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) config Macros = 98, Sorted = 98,
>> StringBytes = 4110, TablesBytes = 3600
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) CLASSAD_CACHING is ENABLED
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) Daemon Log is logging: D_ALWAYS
>> D_ERROR
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) SharedPortEndpoint: waiting for
>> connections to named socket 2358_f868_3
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: command socket at
>> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3>
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: private command
>> socket at
>> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3>
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) History file rotation is enabled.
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) ÂÂMaximum history file size is:
>> 50000000 bytes
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) ÂÂNumber of rotated history
>> files is: 5
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) my_popenv: Failed to exec in
>> child, errno=2 (No such file or directory)
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) Failed to execute
>> /usr/sbin/condor_shadow.std, ignoring
>> ./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager stats:
>> active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
>> ./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager upload 1m
>> I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
>> ./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager download 1m
>> I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
>> ./SchedLog:03/26/18 11:08:47 (pid:2405) Got SIGQUIT. ÂPerforming fast
>> shutdown.
>> ./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
>> ./SchedLog:03/26/18 11:08:47 (pid:2405) All shadows have been killed,
>> exiting.
>> ./SchedLog:03/26/18 11:08:47 (pid:2405) **** condor_schedd
>> (condor_SCHEDD) pid 2405 EXITING WITH STATUS 0
>> ./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
>> ./SchedLog:03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs
>> ./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs
>> ./SchedLog:03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs
>>
>> [6]
>>> Master aka PID:2358
>>>> grep 2358 ./*
>> ./MasterLog:03/26/18 11:08:04 ** PID = 2358
>> ./MasterLog:03/26/18 11:08:05 SharedPortEndpoint: waiting for
>> connections to named socket 2358_f868
>> ./MasterLog:03/26/18 11:08:05 DaemonCore: private command socket at
>> <131.169.223.234:0?sock=2358_f868>
>> ./MasterLog:03/26/18 11:08:47 **** condor_master (condor_MASTER) pid
>> 2358 EXITING WITH STATUS 0
>> ./ProcLog:03/26/18 11:08:05 : Procd has a watcher pid and will die if
>> pid 2358 dies.
>> ./ProcLog:03/26/18 11:08:05 : method PID: found family 2358 for
>> process 2358
>> ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for
>> process 2397
>> ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for
>> process 2397 (already determined)
>> ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for
>> process 2398
>> ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for
>> process 2398 (already determined)
>> ./ProcLog:03/26/18 11:08:05 : new subfamily registered: root = 2398,
>> watcher = 2358
>> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
>> process 2405
>> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
>> process 2405 (already determined)
>> ./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2405,
>> watcher = 2358
>> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
>> process 2406
>> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for
>> process 2406 (already determined)
>> ./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2406,
>> watcher = 2358
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) SharedPortEndpoint: waiting for
>> connections to named socket 2358_f868_3
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: command socket at
>> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3>
>> ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: private command
>> socket at
>> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3>
>>
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature