[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job disconnected



On Aug 24, 2017, at 2:40 AM, Hervà Lemaitre <herve.lemaitre@xxxxxxxxx> wrote:

After a deeper look, it seems related to the condor.service on ubuntu:

 ######################################################################
[Unit]
Description=Condor Distributed High-Throughput-Computing
After=syslog.target network-online.target nslcd.service ypbind.service autofs.service
Wants=network-online.target
# Disabled until HTCondor security fixed.
# Requires=condor.socket

[Service]
Environment=CONDOR_CONFIG=/condor/install_centos/etc/condor_config
ExecStart=/condor/install_ubuntu/sbin/condor_master -f
ExecStop=/condor/install_ubuntu/sbin/condor_off -master
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=1minute
WatchdogSec=20minutes
TimeoutStopSec=90seconds
StandardOutput=syslog
NotifyAccess=main
KillSignal=SIGKILL
LimitNOFILE=16384

[Install]
WantedBy=multi-user.target
######################################################################

The WatchdogSec is set to 20 minutes and it seems there is no keep-alive ping during this interval forcing the service to restart. I'm not familiar with systemd but my first idea is to remove the WatchdogSec. I took the condor.service file in the example directory of the depot but maybe is not the right one.
My manager central manager is on Centos with the same condor.service and do not have this problem of restarting every 20 mins.

Ah, that appears to be the culprit. On Ubuntu 14, Condor doesnât perform the systemd watchdog alive messages, as systemd is not the default init system there. We should change the code to always check for systemd at startup (on all distros) and send alive messages if present. For now, you can modify the WatchdogSec setting for Condor in your systemd config.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project