[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Dealing with a flaky domain controller and startd's that cannot authenticate on startup



Over the weekend one of my Windows pools kept having all it's startds go
into the dreaded Claimed+Idle state. After a whole lot of resetting and
observing I was able to track the problem down. It starts with jobs not
being able to start up as one of the assigned domain accounts:

6/3 13:59:52 ******************************************************
6/3 13:59:52 ** condor_starter (CONDOR_STARTER) STARTING UP
6/3 13:59:52 ** d:\abc\condor\bin\condor_starter.exe
6/3 13:59:52 ** $CondorVersion: 6.8.0 Jul 19 2006 $
6/3 13:59:52 ** $CondorPlatform: INTEL-WINNT50 $
6/3 13:59:52 ** PID = 352
6/3 13:59:52 ** Log last touched 6/3 13:59:19
6/3 13:59:52 ******************************************************
6/3 13:59:52 Using config source:
\\sj-negotiator\condor\configs\condor_config.WINNT51
6/3 13:59:52 Using local config sources: 
6/3 13:59:52    d:/abc/condor/local.sj-bs2330-316/condor_config.local
6/3 13:59:52    d:\abc\condor/condor_config.local
6/3 13:59:52 DaemonCore: Command Socket at <137.57.203.67:2707>
6/3 13:59:52 Setting resource limits not implemented!
6/3 13:59:52 Communicating with shadow <137.57.202.107:43858>
6/3 13:59:52 Submitting machine is "sj-schedd1.altera.com"
6/3 13:59:52 ERROR: Could not locate valid credential for user
'swbatch2@ALTERA'
6/3 13:59:52 Could not initialize user_priv as "ALTERA\swbatch2".
	Make sure this account's password is securely stored with
condor_store_cred.
6/3 13:59:52 ERROR: Failed to determine what user to run this job as,
aborting
6/3 13:59:52 Failed to initialize JobInfoCommunicator, aborting
6/3 13:59:52 Unable to start job.

The problem is intermittent and it does eventually got away. After about
15 minutes a machine that was unable to start jobs eventually is able to
connect and start the job. This leads me to believe it's the domain
controller (or the network to the domain controller) and not something
wrong with the credentials stored on the machine. But it's inconceivable
that Condor is some how getting a bad password for the account --
although I don't see how it could get a bad passwd from it's cache on
one attempt that then not on the next attempt. Can someone from the
Condor Windows team comment on that?

The bigger problem is after a while the recover rate for failed startd's
starts to fall way behind the startup rate for new jobs and the
scheduler starts assigning jobs to startd's that are still claimed but
have had their claim expire. Eventually I end up with all the Windows
machines in my pool in the Claimed+Idle state and the queue processing
ends up essentially halted. And I start seeing at my Windows startd's
asserting in NTsenders.C and NTrecievers.C:

6/4 08:40:20 ******************************************************
6/4 08:40:20 ** condor_starter (CONDOR_STARTER) STARTING UP
6/4 08:40:20 ** d:\abc\condor\bin\condor_starter.exe
6/4 08:40:20 ** $CondorVersion: 6.8.0 Jul 19 2006 $
6/4 08:40:20 ** $CondorPlatform: INTEL-WINNT50 $
6/4 08:40:20 ** PID = 2552
6/4 08:40:20 ** Log last touched 6/4 07:45:13
6/4 08:40:20 ******************************************************
6/4 08:40:20 Using config source:
\\sj-negotiator\condor\configs\condor_config.WINNT51
6/4 08:40:20 Using local config sources:
6/4 08:40:20    d:/abc/condor/local.sj-bs3400-308/condor_config.local
6/4 08:40:20    d:\abc\condor/condor_config.local
6/4 08:40:20 DaemonCore: Command Socket at <137.57.203.1:4043>
6/4 08:40:20 Setting resource limits not implemented!
6/4 08:40:20 Communicating with shadow <137.57.202.107:34912>
6/4 08:40:20 Submitting machine is "sj-schedd1.altera.com"
6/4 08:40:20 File transfer completed successfully.
6/4 08:40:21 Starting a VANILLA universe job with ID: 1813.1481
6/4 08:40:21 IWD: d:\abc\condor/execute\dir_2552
6/4 08:40:21 Output file: d:\abc\condor/execute\dir_2552\wrapper.log
6/4 08:40:21 Error file: d:\abc\condor/execute\dir_2552\wrapper.err
6/4 08:45:21 condor_read(): timeout reading buffer.
6/4 08:45:21 ERROR "Assertion ERROR on (result)" at line 322 in file
..\src\condor_starter.V6.1\NTsenders.C
6/4 08:45:21 ShutdownFast all jobs.

Once the machine gets here it's lost. I have to hold all the jobs in my
queue and wait for the Claimed+Idles to clear, release the jobs and see
it slowly degenerate into this situation again.

I'm trying to figure out how to tweak my Condor pool so it can deal with
this scenario. I'm okay if job processing falls to crawl if the domain
controller gets flaky, but I don't want to see my pool end up all
Claimed+Idle like it is now and doing zero work. This goes beyond my
Condor configuration knowledge so I'm asking anyone out there if they
know how to configure a system to tolerate a flaky domain controller
like this. I'm looking for the startd equivalent to JobLeaseDuration I
guess -- how I tell the startd to slow down the job dump rate if it
cannot start a job up properly.

- Ian


--
Ian R. Chesal <ichesal@xxxxxxxxxx>
Senior Software Engineer

Altera Corporation
Toronto Technology Center
Tel: (416) 926-8300