[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor Cred issues on windows pool



Fellow condor users,

 

I have an all-windows condor pool consisting of 1 central manager, 2-3 schedulers, a dedicated credd server and about 30 execute machines.  All machines, except for the credd server, have startd running on them and can therefore accept jobs in some capacity.  I have a mix of versions from 7.6.8 up to 8.0.3.  This configuration has worked seamlessly for about 3 years now until last week when my credd server died, and I had to migrate to a new machine.   To do so I copied the config.local of the old credd server (fortunately I had a backup) to a nearly identical machine (same OS, hardware, etc.), and have been unable to bring the pool up since.  I have pasted what I think are the relevant configuration settings as well as telling log messages below.  In a nutshell, jobs are starting then crashing because they cannot find a password for my account.  However, when I run

 

condor_store_cred add

 

It completes successfully, and indeed

 

condor_store_cred query

 

reports that credentials have been stored and are valid.  I cannot find the disconnect between the credd server/scheduler and starters.  I have tried changing credd servers (the configuration below actually has the CM, 10.1.216.182, as the credd server), different users, different schedulers, and always end up with the same result.  Furthermore, the log messages are not leading me to an answer as they had in the past.  Has anyone managed to work through this issue?  If so, I would greatly appreciate some guidance.

 

Eric

 

Condor_config.local (CM):

   CREDD_HOST = $(FULL_HOSTNAME)

   STARTER_ALLOW_RUNAS_OWNER = True

   CREDD_CACHE_LOCALLY = True

   SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD

   ALLOW_CONFIG = Administrator@*,$(CONDOR_HOST)

   SEC_CONFIG_NEGOTIATION = REQUIRED

   SEC_CONFIG_AUTHENTICATION = REQUIRED

   SEC_CONFIG_ENCRYPTION = REQUIRED

   SEC_CONFIG_INTEGRITY = REQUIRED

 

CREDD_LOG = $(LOG)/CreddLog

CREDD_DEBUG = D_COMMAND

MAX_CREDD_LOG = 50000000

 

ALLOW_CONFIG = $(IP_ADDRESS),$(CONDOR_HOST),Administrator@*

ALLOW_WRITE = 10.*,*.$(UID_DoMAIN)

ALLOW_READ = *

---------------------------------------------------------------------------------------------

Condor_config.local (schedd)

CREDD_HOST = $(CONDOR_HOST).$(UID_DOMAIN)

STARTER_ALLOW_RUNAS_OWNER = True

CREDD_CACHE_LOCALLY = True

SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD

SEC_CONFIG_NEGOTIATION = REQUIRED

SEC_CONFIG_AUTHENTICATION = REQUIRED

SEC_CONFIG_ENCRYPTION = REQUIRED

SEC_CONFIG_INTEGRITY = REQUIRED

 

ALLOW_CONFIG = $(IP_ADDRESS),$(CONDOR_HOST),Administrator@*

ALLOW_WRITE = $(FULL_HOSTNAME),$(IP_ADDRESS),*.vms.ad.varian.com,10.1.*

 

 

WorkHours = ( (ClockMin >= 450 && ClockMin < 1080) && \

(ClockDay > 0 && ClockDay < 6) )

AfterHours = ( (ClockMin < 450 || ClockMin >= 1080) || \

(ClockDay == 0 || ClockDay == 6) )

 

#START = $(AfterHours) && $(UWCS_START)

 

#SUSPEND = $(WorkHours) || $(UWCS_SUSPEND)

 

#PREEMPT = $(WorkHours)

 

#START = TRUE

#SUSPEND = FALSE

#KILL = FALSE

#PREEMPT = FALSE

#STARTD_DEBUG=D_ALL

#MAX_NUM_CPUS = 3

DAEMON_LIST = MASTER, KBDD, SCHEDD

 

-------------------------------------------------------------------------------------------------

Log excerpt  from Starter.slot1

 

08/05/14 07:11:35 Using config source: C:\condor\condor_config

08/05/14 07:11:35 Using local config sources:

08/05/14 07:11:35    C:\condor/condor_config.local

08/05/14 07:11:35 DaemonCore: command socket at <10.1.216.198:58156>

08/05/14 07:11:35 DaemonCore: private command socket at <10.1.216.198:58156>

08/05/14 07:11:35 GLEXEC_JOB not supported on this platform; ignoring

08/05/14 07:11:35 Communicating with shadow <10.1.216.182:3690>

08/05/14 07:11:35 Submitting machine is "mv6d8xfmnb1.vms.ad.varian.com"

08/05/14 07:11:35 setting the orig job name in starter

08/05/14 07:11:35 setting the orig job iwd in starter

08/05/14 07:11:39 ERROR: Could not locate valid credential for user 'eabel@VMS'

08/05/14 07:11:39 Could not initialize user_priv as "VMS\eabel".

                Make sure this account's password is securely stored with condor_store_cred.

08/05/14 07:11:39 ERROR: Failed to determine what user to run this job as, aborting

08/05/14 07:11:39 Failed to initialize JobInfoCommunicator, aborting

08/05/14 07:11:39 Unable to start job.

 

 

Log excerpt from Credd.log

 

NewSession = "YES"

ParentUniqueID = "MV6D8XFMNB1:6996:1407247268"

AuthMethods = "NTSSPI, PASSWORD"

Enact = "NO"

CryptoMethods = "3DES,BLOWFISH"

OutgoingNegotiation = "PREFERRED"

CurrentTime = time()

RemoteVersion = "$CondorVersion: 7.8.7 Dec 12 2012 BuildID: 86173 $"

ServerCommandSock = "<10.1.216.182:1450>"

Integrity = "OPTIONAL"

ServerPid = 7984

Encryption = "OPTIONAL"

Authentication = "OPTIONAL"

SessionLease = 3600

SessionDuration = "86400"

Subsystem = "SHADOW"

Command = 81099

08/05/14 10:02:37 condor_write(): Socket closed when trying to write 291 bytes to <10.1.216.182:1750>, fd is 464

08/05/14 10:02:37 Buf::write(): condor_write() failed

08/05/14 10:02:37 SECMAN: Error sending response classad to <10.1.216.182:1750>!