[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Procd behaving badly in a multi-startd setup




On Thursday, 8 September, 2011 at 11:09 AM, Dan Bradley wrote:



On 9/7/11 4:41 PM, Ian Chesal wrote:
On Wednesday, 7 September, 2011 at 5:15 PM, Ian Chesal wrote:
I may have spoken too quickly on the multi-startd setup working. I thought my troubles were due to collisions on the starter log files, but after implementing the fix Todd suggested I'm still seeing some bad behaviour (but the fix for the log files worked brilliantly).

It appears that I can only start jobs under one startd or the other. Not both. The first startd to run jobs after a Condor restart is the *only* startd that will run jobs until Condor is restarted again.

For example: I submitted two clusters of jobs. Once targeted the slots on the first startd. The other targeted the slots on the second startd. If I let the first cluster start on the S1 startd then the second cluster would attempt to run on the S2 startd and fail. And vice versa.

The log output on failure is always the same:

09/07/11 17:07:43 slot1: Got activate_claim request from shadow (<192.168.1.85:3382>)
09/07/11 17:07:43 slot1: Remote job ID is 9.0
09/07/11 17:07:43 Result of "register_subfamily" operation from ProcD: ERROR: The given PID is not part of the family tree
09/07/11 17:07:43 Create_Process: error registering family for pid 1256
09/07/11 17:07:43 ERROR "error registering process family with procd" at line 7917 in file c:\condor\execute\dir_4228\userdir\src\condor_daemon_core.v6\daemon_core.cpp
09/07/11 17:07:43 slot1: Changing state and activity: Claimed/Idle -> Preempting/Killing
09/07/11 17:07:43 slot1: State change: No preempting claim, returning to owner
09/07/11 17:07:43 slot1: Changing state and activity: Preempting/Killing -> Owner/Idle
09/07/11 17:07:43 slot1: State change: IS_OWNER is false
09/07/11 17:07:43 slot1: Changing state: Owner -> Unclaimed

It looks like the procd doesn't like the idea of two startds on the machine. It appears it can't tell them apart apparently and doesn't like the fact that the jobs being started on the second startd in this case don't have a PPID equal to the PID of the first startd.

I'm either missing something that's procd-specific in my startd config, or the procd isn't going to work here. I'll try disabling the procd but having it there has helped with scalability issues I'm trying to overcome so if I can make this work with the procd in place I'd be a whole lot happier.

Going with:

USE_PROCD = False

gets both starts working, but sets me back as scalability seems to be limited to ~10-12 slots per startd without the procd on a Win2k8 box.


Hi Ian,

The problem is likely that both startd's are creating their own procd, but these two procds are using the same named pipe for communication, so wires are getting crossed.  You could configure PROCD_PIPE differently for the two startds.  Or you could just configure the startds to share a single procd.  One way to achieve that is this:

MASTER.USE_PROCD = TRUE

That causes the master to create a procd, which is then shared by all of its children.  Depending on the answer to your puzzling performance problems, having a single procd may be better than two.  Then again, it could be worse.  It would be interesting to find out!

I'll try that, thanks Dan.

Looking that the process tree the condor_procd appears as a child of the condor_schedd (this is a standalone pool on a single machine I'm using for testing). I didn't notice the startds spawning their own procds at launch time, but I may have just missed that.

I'll report back on things.

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing