I've fixed the issue.
For some reason the /var/lock/condor directory was re-grouped/owned under root.
Changing it back to a condor user/group and restarting was enough for them to register in the pool.
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Iain Bradford Steers [iain.steers@xxxxxxx]
Sent: 10 March 2015 08:45
Subject: [HTCondor-users] SharedPortEndpoint: failed to bind to /var/lock/condor/daemon_sock/25689_90ae: Permission denied
I noticed some of my worker nodes never showed up in condor_status after creating them.
Doing a pstree on the nodes shows that startd wasn't running. I attempted to start it and encountered the following situation.
03/10/15 08:38:05 Can't open "/var/log/condor/StartLog"
ERROR "Cannot open log file '/var/log/condor/StartLog'" at line 208 in file /slots/01/dir_21000/userdir/src/condor_utils/dprintf_setup.cpp
So I temporarily renamed the file and I'm now getting the following in the StartLog.
03/10/15 08:24:38 ERROR: SharedPortEndpoint: failed to bind to /var/lock/condor/daemon_sock/25689_90ae: Permission denied
03/10/15 08:24:38 ERROR "Failed to start local listener (USE_SHARED_PORT=true)" at line 2897 in file /slots/01/dir_21000/userdir/src/condor_daemon_core.V6/daemon_core.cpp
I'm using Puppet to configure htcondor so it doesn't appear to be a differing config between successful worker nodes and this.