[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor high availability



Hi again,

searching through the log files once more something caught my eye: When
running condor_q on master2 while master1 is active, the following lines
appear in SchedLog (along with the segmentation fault message):

10/08/20 11:50:30 (pid:47347) Number of Active Workers 0  
10/08/20 11:50:41 (pid:47347) AUTHENTICATE: handshake failed!    
10/08/20 11:50:41 (pid:47347) DC_AUTHENTICATE: authentication of
<192.168.1.22:10977> did not result in a valid mapped user name, which is
required for this command (519 QUERY_JOB_ADS_WITH_AUTH), so aborting. 
10/08/20 11:50:41 (pid:47347) DC_AUTHENTICATE: reason for authentication
failure: AUTHENTICATE:1002:Failure performing
handshake|AUTHENTICATE:1004:Failed to authenticate using
KERBEROS|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to
lstat(/tmp/FS_XXXGNYmKn)  

Do I need to configure any other authentication methods in addition to all
servers using LDAP via PAM ?

Kind regards

Christian



-----Ursprüngliche Nachricht-----
Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von
Hennen, Christian
Gesendet: Donnerstag, 1. Oktober 2020 12:58
An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Betreff: Re: [HTCondor-users] HTCondor high availability

Hello Thomas,

the spool directory (/clients/condor/spool) is located on a NFS v3 share
every server has access to (/clients). All machines have a local user
(r-admin) with uid and gid 1000 and the spool directory is owned by that
user, since it is configured as the Condor user (see condor_config.local in
the Serverfault thread). Every other user is mapped via LDAP on every server
including the storage cluster. On both master servers the user "condor" has
the same uid and gid.

Kind regards

Christian


-----Ursprüngliche Nachricht-----
Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von
Thomas Hartmann
Gesendet: Donnerstag, 1. Oktober 2020 11:46
An: htcondor-users@xxxxxxxxxxx
Betreff: Re: [HTCondor-users] HTCondor high availability

Hi Christian,

the spool dir resides on a shared file system between both nodes, or?
Maybe you can check, if it is writable from both clients and if the
users/permissions work for both? (sometimes NFS is a bit fiddly with the ID
mapping...)

Cheers,
  Thomas


On 01/10/2020 09.58, Hennen, Christian wrote:
> Hi,
> 
>  
> 
> I am currently trying to make the job queue and submission mechanism 
> of a local, isolated HTCondor cluster highly available. The cluster 
> consists of 2 master servers (previously 1) and several compute nodes 
> and a central storage system. DNS, LDAP and other services are 
> provided by the master servers.
> 
>  
> 
> I followed the directions under
> https://htcondor.readthedocs.io/en/latest/admin-manual/high-availabili
> ty.html but it doesn?t seem to work the way it should. Further 
> information about the setup and the problems has been posted to
> Serverfault:
> https://serverfault.com/questions/1035879/htcondor-high-availability
> 
>  
> 
> Maybe any of you have got any insights on this? Any help would be 
> appreciated!
> 
>  
> 
> Kind regards
> 
> *
> Christian Hennen*, M.Sc.**
> 
> Project Manager Infrastructural Services
> 
> Zentrum für Informations-, Medien-
> 
> und Kommunikationstechnologie (ZIMK)
> 
>  
> 
> cid:image001.png@01D491F5.AD0E2F30
> 
>  
> 
> Universität Trier | Universitätsring 15 | 54296 Trier | Germany 
> www.uni-trier.de <http://www.uni-trier.de/>
> 
>  
> 
> <https://50jahre.uni-trier.de/>
> 
>  
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature