[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] STARTD_CRON stops running



Hello Condor users & experts,

I am using STARTD_CRON to do periodic health checks on worker nodes.
However, on some of the nodes I see that the script is no longer logging
any output, and the nodes do not detect unhealthy states.  Using
condor_config_val -dump, I see the CRON settings are in place:
CRON_JOBLIST = nodecheck
CRON_NODECHECK_EXECUTABLE = /usr/local/sbin/condor_node_check.sh
CRON_NODECHECK_KILL = true
CRON_NODECHECK_MODE = periodic
CRON_NODECHECK_PERIOD = 15m
CRON_NODECHECK_RECONFIG = false
STARTD_CRON_NAME = CRON
The script is world-executable, and the log file is world-writable. My
version of condor is 7.6.0-1.

I wonder if I am being affected by the following bug.
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2437
Is there any way to expose the current value of CRON_*_SENT?  I see many
instances of this message in StartLog:
StartLog.old:11/01/11 05:08:27 CronJob: Job 'nodecheck' not idle!
Is there a way to reset CRON_*_SENT without killing running jobs?

--Sarah