[HTCondor-users] Credd timeouts lead to failed logins lead to account lockouts

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

HTCondor for Windows 8.4.8:

We’re running as owner
It’s worked pretty well for a dozen years.
But lately, we’re getting intermittent avalanches of failed logins on the executing machines…
…which lead to our accounts getting temporarily locked out (and wow, does that make the failed login problem worse!)

I traced one failed login that wasn’t due to the account lockout backwards.

StarterLog says it timed out trying to read from the credd.

06/21/19 11:38:42 (pid:4668) init_user_ids: want user 'user@DOMAIN', current is '(null)@(null)'

06/21/19 11:38:42 (pid:4668) Locally stored credential for user@DOMAIN is stale

06/21/19 11:38:42 (pid:4668) trying to fetch password from credd: machine.where.credd.lives:9620

06/21/19 11:38:52 (pid:4668) condor_read(): timeout reading 5 bytes from credd machine.where.credd.lives:9620.

06/21/19 11:38:52 (pid:4668) IO: Failed to read packet header

06/21/19 11:38:52 (pid:4668) SECMAN: no classad from server, failing

06/21/19 11:38:52 (pid:4668) ERROR: SECMAN:2007:Failed to end classad message.

06/21/19 11:38:52 (pid:4668) Failed to contact credd machine.where.credd.lives:9620:

06/21/19 11:38:52 (pid:4668) ERROR: Could not locate valid credential for user 'user@DOMAIN'

06/21/19 11:38:52 (pid:4668) Could not initialize user_priv as "user@DOMAIN".

Make sure this account's password is securely stored with condor_store_cred.

06/21/19 11:38:52 (pid:4668) ERROR: Failed to determine what user to run this job as, aborting

06/21/19 11:38:52 (pid:4668) Failed to initialize JobInfoCommunicator, aborting

06/21/19 11:38:52 (pid:4668) Unable to start job.

06/21/19 11:38:52 (pid:4668) **** condor_starter (condor_STARTER) pid 4668 EXITING WITH STATUS 1

06/21/19 11:38:52 (pid:4668) Deleting the StarterHookMgr

What’s going on at credd?

11:38:52 already scrolled out of the logs. The CreddLog.old is 10MB (in 3 seconds!) of nothing but:

06/21/19 12:20:46 condor_read() failed: recv() 5 bytes from <aa.bb.cc.dd:54858> returned -1, timeout=20, errno=10054 .

06/21/19 12:20:46 IO: Failed to read packet header

The port number changes, but the IP address seems constant as I quickly scan thru the file.

Turns out that aa.bb.cc.dd points to the workstation I was using to launch the jobs which ended up getting my account locked out.

I can’t see anything obviously useful in its schedd or shadow logs – or its Windows Event logs.

Questions:

Why is credd trying to read from my launching workstation at all?
Where might I look to figure out why those reads timeout? (I’ve been thru the Condor & Windows Event logs, nothing obvious in either place)
Are those timeouts the reason credd can’t respond to the job running machine?
Why does the job running machine insist on trying to log on as user@DOMAIN even when the credd lookup fails?

Thanks!!

Mailing List Archives