[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Procd behaving badly in a multi-startd setup



On Wednesday, 7 September, 2011 at 5:15 PM, Ian Chesal wrote:
I may have spoken too quickly on the multi-startd setup working. I thought my troubles were due to collisions on the starter log files, but after implementing the fix Todd suggested I'm still seeing some bad behaviour (but the fix for the log files worked brilliantly).

It appears that I can only start jobs under one startd or the other. Not both. The first startd to run jobs after a Condor restart is the *only* startd that will run jobs until Condor is restarted again.

For example: I submitted two clusters of jobs. Once targeted the slots on the first startd. The other targeted the slots on the second startd. If I let the first cluster start on the S1 startd then the second cluster would attempt to run on the S2 startd and fail. And vice versa.

The log output on failure is always the same:

09/07/11 17:07:43 slot1: Got activate_claim request from shadow (<192.168.1.85:3382>)
09/07/11 17:07:43 slot1: Remote job ID is 9.0
09/07/11 17:07:43 Result of "register_subfamily" operation from ProcD: ERROR: The given PID is not part of the family tree
09/07/11 17:07:43 Create_Process: error registering family for pid 1256
09/07/11 17:07:43 ERROR "error registering process family with procd" at line 7917 in file c:\condor\execute\dir_4228\userdir\src\condor_daemon_core.v6\daemon_core.cpp
09/07/11 17:07:43 slot1: Changing state and activity: Claimed/Idle -> Preempting/Killing
09/07/11 17:07:43 slot1: State change: No preempting claim, returning to owner
09/07/11 17:07:43 slot1: Changing state and activity: Preempting/Killing -> Owner/Idle
09/07/11 17:07:43 slot1: State change: IS_OWNER is false
09/07/11 17:07:43 slot1: Changing state: Owner -> Unclaimed

It looks like the procd doesn't like the idea of two startds on the machine. It appears it can't tell them apart apparently and doesn't like the fact that the jobs being started on the second startd in this case don't have a PPID equal to the PID of the first startd.

I'm either missing something that's procd-specific in my startd config, or the procd isn't going to work here. I'll try disabling the procd but having it there has helped with scalability issues I'm trying to overcome so if I can make this work with the procd in place I'd be a whole lot happier.

Going with:

USE_PROCD = False

gets both starts working, but sets me back as scalability seems to be limited to ~10-12 slots per startd without the procd on a Win2k8 box.

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing