I may have spoken too quickly on the multi-startd
setup working. I thought my troubles were due to
collisions on the starter log files, but after
implementing the fix Todd suggested I'm still seeing
some bad behaviour (but the fix for the log files worked
brilliantly).
It appears that I can only start jobs under one
startd or the other. Not both. The first startd to run
jobs after a Condor restart is the *only* startd that
will run jobs until Condor is restarted again.
For example: I submitted two clusters of jobs. Once
targeted the slots on the first startd. The other
targeted the slots on the second startd. If I let the
first cluster start on the S1 startd then the second
cluster would attempt to run on the S2 startd and fail.
And vice versa.
The log output on failure is always the same:
09/07/11 17:07:43 slot1: Got activate_claim request
from shadow (<192.168.1.85:3382>)
09/07/11 17:07:43 slot1: Remote job ID is 9.0
09/07/11 17:07:43 Result of "register_subfamily"
operation from ProcD: ERROR: The given PID is not part
of the family tree
09/07/11 17:07:43 Create_Process: error registering
family for pid 1256
09/07/11 17:07:43 ERROR "error registering process
family with procd" at line 7917 in file
c:\condor\execute\dir_4228\userdir\src\condor_daemon_core.v6\daemon_core.cpp
09/07/11 17:07:43 slot1: Changing state and
activity: Claimed/Idle -> Preempting/Killing
09/07/11 17:07:43 slot1: State change: No
preempting claim, returning to owner
09/07/11 17:07:43 slot1: Changing state and
activity: Preempting/Killing -> Owner/Idle
09/07/11 17:07:43 slot1: State change: IS_OWNER is
false
09/07/11 17:07:43 slot1: Changing state: Owner
-> Unclaimed
It looks like the procd doesn't like the idea of two
startds on the machine. It appears it can't tell them
apart apparently and doesn't like the fact that the jobs
being started on the second startd in this case don't
have a PPID equal to the PID of the first startd.
I'm either missing something that's procd-specific in
my startd config, or the procd isn't going to work here.
I'll try disabling the procd but having it there has
helped with scalability issues I'm trying to overcome so
if I can make this work with the procd in place I'd be a
whole lot happier.