[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Independent condor configuration files.



I have a couple of condor pools configured
with all of their condor configuration files on the local
disk of each machine.  There are a lot of compute nodes running
only master and startd.  This morning, after a power outage,
the compute nodes came up before the node that runs the collector/negotiator. They started the master and startd all right,
StartLog looks like this:

6/7 07:29:10 ******************************************************
6/7 07:29:10 ** condor_startd (CONDOR_STARTD) STARTING UP
6/7 07:29:10 ** /opt/condor-6.7.18/sbin/condor_startd
6/7 07:29:10 ** $CondorVersion: 6.7.18 Mar 22 2006 $
6/7 07:29:10 ** $CondorPlatform: I386-LINUX_RH9 $
6/7 07:29:10 ** PID = 3106
6/7 07:29:10 ******************************************************
6/7 07:29:10 Using config file: /etc/condor/condor_config
6/7 07:29:10 Using local config files: /opt/condor/etc/group_params.config /opt/
condor/local/condor_config.local
6/7 07:29:10 DaemonCore: Command Socket at <131.225.167.91:32771>
6/7 07:32:39 vm1: New machine resource allocated
6/7 07:32:39 vm2: New machine resource allocated
6/7 07:32:39 About to run initial benchmarks.
6/7 07:32:44 Completed initial benchmarks.
6/7 07:32:44 vm1: State change: IS_OWNER is false
6/7 07:32:44 vm1: Changing state: Owner -> Unclaimed
6/7 07:32:44 vm2: State change: IS_OWNER is false
6/7 07:32:44 vm2: Changing state: Owner -> Unclaimed
6/7 07:32:48 vm1: Error sending update to collector(s)
6/7 07:32:49 vm2: Error sending update to collector(s)
6/7 07:37:48 vm1: Error sending update to collector(s)
6/7 07:37:49 vm2: Error sending update to collector(s)
6/7 07:42:48 vm1: Error sending update to collector(s)
6/7 07:42:49 vm2: Error sending update to collector(s)
6/7 07:47:48 vm1: Error sending update to collector(s)
6/7 07:47:49 vm2: Error sending update to collector(s)
6/7 07:52:48 vm1: Error sending update to collector(s)


and so forth.. these errors sending the update to collectors
continued well after the collectors were up and working.
I had to stop and restart condor on all of these nodes
to get them to be seen in condor_status on the collector.

Is this expected behavior?  Has anyone successfully
configured condor in such a pool such that you do not
have to restart condor on all nodes in such a situation?

Steve Timm


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team