I may have spoken too quickly on the
multi-startd setup working. I thought my
troubles were due to collisions on the
starter log files, but after implementing
the fix Todd suggested I'm still seeing
some bad behaviour (but the fix for the
log files worked brilliantly).
It appears that I can only start jobs
under one startd or the other. Not both.
The first startd to run jobs after a
Condor restart is the *only* startd that
will run jobs until Condor is restarted
again.
For example: I submitted two clusters
of jobs. Once targeted the slots on the
first startd. The other targeted the slots
on the second startd. If I let the first
cluster start on the S1 startd then the
second cluster would attempt to run on the
S2 startd and fail. And vice versa.
The log output on failure is always the
same:
09/07/11 17:07:43 slot1: Got
activate_claim request from shadow
(<192.168.1.85:3382>)
09/07/11 17:07:43 slot1: Remote job
ID is 9.0
09/07/11 17:07:43 Result of
"register_subfamily" operation from
ProcD: ERROR: The given PID is not part
of the family tree
09/07/11 17:07:43 Create_Process:
error registering family for pid 1256
09/07/11 17:07:43 ERROR "error
registering process family with procd"
at line 7917 in file
c:\condor\execute\dir_4228\userdir\src\condor_daemon_core.v6\daemon_core.cpp
09/07/11 17:07:43 slot1: Changing
state and activity: Claimed/Idle ->
Preempting/Killing
09/07/11 17:07:43 slot1: State
change: No preempting claim, returning
to owner
09/07/11 17:07:43 slot1: Changing
state and activity: Preempting/Killing
-> Owner/Idle
09/07/11 17:07:43 slot1: State
change: IS_OWNER is false
09/07/11 17:07:43 slot1: Changing
state: Owner -> Unclaimed
It looks like the procd doesn't like
the idea of two startds on the machine. It
appears it can't tell them apart
apparently and doesn't like the fact that
the jobs being started on the second
startd in this case don't have a PPID
equal to the PID of the first startd.
I'm either missing something that's
procd-specific in my startd config, or the
procd isn't going to work here. I'll try
disabling the procd but having it there
has helped with scalability issues I'm
trying to overcome so if I can make this
work with the procd in place I'd be a
whole lot happier.