[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Credd timeouts lead to failed logins lead to account lockouts



HTCondor for Windows 8.4.8:

  • We’re running as owner
  • It’s worked pretty well for a dozen years.
  • But lately, we’re getting intermittent avalanches of failed logins on the executing machines…
  • …which lead to our accounts getting temporarily locked out (and wow, does that make the failed login problem worse!)

 

I traced one failed login that wasn’t due to the account lockout backwards.

StarterLog says it timed out trying to read from the credd.

 

06/21/19 11:38:42 (pid:4668) init_user_ids: want user 'user@DOMAIN', current is '(null)@(null)'

06/21/19 11:38:42 (pid:4668) Locally stored credential for user@DOMAIN is stale

06/21/19 11:38:42 (pid:4668) trying to fetch password from credd: machine.where.credd.lives:9620

06/21/19 11:38:52 (pid:4668) condor_read(): timeout reading 5 bytes from credd machine.where.credd.lives:9620.

06/21/19 11:38:52 (pid:4668) IO: Failed to read packet header

06/21/19 11:38:52 (pid:4668) SECMAN: no classad from server, failing

06/21/19 11:38:52 (pid:4668) ERROR: SECMAN:2007:Failed to end classad message.

06/21/19 11:38:52 (pid:4668) Failed to contact credd machine.where.credd.lives:9620:

06/21/19 11:38:52 (pid:4668) ERROR: Could not locate valid credential for user 'user@DOMAIN'

06/21/19 11:38:52 (pid:4668) Could not initialize user_priv as "user@DOMAIN".

       Make sure this account's password is securely stored with condor_store_cred.

06/21/19 11:38:52 (pid:4668) ERROR: Failed to determine what user to run this job as, aborting

06/21/19 11:38:52 (pid:4668) Failed to initialize JobInfoCommunicator, aborting

06/21/19 11:38:52 (pid:4668) Unable to start job.

06/21/19 11:38:52 (pid:4668) **** condor_starter (condor_STARTER) pid 4668 EXITING WITH STATUS 1

06/21/19 11:38:52 (pid:4668) Deleting the StarterHookMgr

 

What’s going on at credd?

 

11:38:52 already scrolled out of the logs.  The CreddLog.old is 10MB (in 3 seconds!) of nothing but:

06/21/19 12:20:46 condor_read() failed: recv() 5 bytes from <aa.bb.cc.dd:54858> returned -1, timeout=20, errno=10054 .

06/21/19 12:20:46 IO: Failed to read packet header

The port number changes, but the IP address seems constant as I quickly scan thru the file.

 

Turns out that aa.bb.cc.dd points to the workstation I was using to launch the jobs which ended up getting my account locked out.

I can’t see anything obviously useful in its schedd or shadow logs – or its Windows Event logs.

 

Questions:

  • Why is credd trying to read from my launching workstation at all?
  • Where might I look to figure out why those reads timeout?  (I’ve been thru the Condor & Windows Event logs, nothing obvious in either place)
  • Are those timeouts the reason credd can’t respond to the job running machine?
  • Why does the job running machine insist on trying to log on as user@DOMAIN even when the credd lookup fails?

 

Thanks!!