[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Configuring a CE/Schedd



Hi Brain,

The messages are from /var/log/condor/SchedLog.

I figured it might have something to do with local permissions to certificate files etc.

It seems to be something unique to the scheduler daemon as I'm not experiencing this on the collector or the worker nodes. I've included a dump of all the schedd values in the condor config.(*)

The machine is definitely configured with a host certificate as it uses it when communicating with other infrastructure and the ARC CE also uses it.

The permissions of the certificate and key also match up with the other machines.

Thanks, Iain

(*)
~]# condor_config_val -expand -dump | grep SCHEDD
ALLOW_NEGOTIATOR_SCHEDD = central-manager@xxxxxxx/*.cern.ch
COLLECTOR.ALLOW_ADVERTISE_SCHEDD = computing-element@xxxxxxx/*.cern.ch,schedd@xxxxxxx/*.cern.ch
DAEMON_LIST = MASTER, SHARED_PORT, SCHEDD
GRIDMANAGER_CONTACT_SCHEDD_DELAY = 5
MAX_NUM_SCHEDD_LOG = 10
MAX_SCHEDD_LOG = 104857600
SCHEDD = /usr/sbin/condor_schedd
SCHEDD.ALLOW_READ =  *@cern.ch/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch,schedd@xxxxxxx/*.cern.ch,worker-node@xxxxxxx/*.cern.ch
SCHEDD.ALLOW_WRITE = *@fsauth/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch
SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = GSI,KERBEROS,FS
SCHEDD_ADDRESS_FILE = /var/lib/condor/spool/.schedd_address
SCHEDD_BACKUP_SPOOL = 
SCHEDD_CRON_NAME = 
SCHEDD_DAEMON_AD_FILE = /var/lib/condor/spool/.schedd_classad
SCHEDD_DEBUG = D_PID
SCHEDD_INTERVAL = 
SCHEDD_JOB_QUEUE_LOG_FLUSH_DELAY = 5
SCHEDD_LOG = /var/log/condor/SchedLog
SCHEDD_MAX_FILE_DESCRIPTORS = 4096
SCHEDD_MIN_INTERVAL = 5
SCHEDD_NAME = 
SCHEDD_PREEMPTION_RANK = 
SCHEDD_PREEMPTION_REQUIREMENTS = 
SCHEDD_QUERY_WORKERS = 6
SCHEDD_ROUND_ATTR_ProportionalSetSizeKb = 25%
SCHEDD_ROUND_ATTR_ResidentSetSize = 25%
SCHEDD_SEND_VACATE_VIA_TCP = false
SCHEDD_SUPER_ADDRESS_FILE = /var/lib/condor/spool/.schedd_address.super
SCHEDDS = schedd@xxxxxxx/*.cern.ch
SETTABLE_ATTRS_ADVERTISE_SCHEDD = 
STATISTICS_WINDOW_QUANTUM_SCHEDD = 

________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Brian Bockelman [bbockelm@xxxxxxxxxxx]
Sent: 25 March 2015 18:37
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Configuring a CE/Schedd

Hi Iain,

>From the message:

> 03/24/15 19:13:14 DC_AUTHENTICATE: required authentication of 128.142.132.67 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXWRRJqi)|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.

The important part of the message is:

"the remote (client) side was not able to acquire its credentials”

This indicates that the schedd isn’t using its certificate (or isn’t configured with one).

Is this from a shadow log?  If so, you don't want to be using any of these methods - you should be using match auth for your setup.  Perhaps that's something which got lost in the merge?

Brian

> On Mar 24, 2015, at 1:48 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
>
> Hi,
>
> I'm in the process of finalizing our CE/Schedd setup for our pool, we're using Puppet.
>
> I had the CE working and acting as a scheduler with a manual config and decided to move it to the HEP-Puppet/htcondor module.
>
> This is the output I get in SchedLog(*), I've removed the ip but it's the machine's own ip in all instances.
>
> After this it just proceeds to spam condor_write errors until it fills the log file and starts a new one.
>
> The ce is in the certificate mapfile along with all the other hosts and apart from the ordering of hostnames a vimdiff shows no difference between the security config file for this and the one that the central manager uses.
>
> Has anyone else experienced this issue?
>
> Thanks, Iain
>
> (*)
> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'ScheddIpAddr' <MACHINE_IP:9618?noUDP&sock=17305_aee5_3> == <MACHINE_IP:9618?noUDP&sock=17305_aee5_3>, but old logic couldn't find the command port for outbound interface MACHINE_IP.
> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'ScheddIpAddr' address in ad (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>) == command socket (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>), but old logic couldn't find that command socket in its list.
> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'MyAddress' <MACHINE_IP:9618?noUDP&sock=17305_aee5_3> == <MACHINE_IP:9618?noUDP&sock=17305_aee5_3>, but old logic couldn't find the command port for outbound interface MACHINE_IP.
> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'MyAddress' address in ad (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>) == command socket (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>), but old logic couldn't find that command socket in its list.
> 03/24/15 19:12:33 -------- Begin starting jobs --------
> 03/24/15 19:12:33 -------- Done starting jobs --------
> 03/24/15 19:13:14 Received a superuser command
> 03/24/15 19:13:14 This process has a valid certificate & key
> 03/24/15 19:13:14 Failed to read end of message from <MACHINE_IP:34711>; 1280 untouched bytes.
> 03/24/15 19:13:14 condor_write(): Socket closed when trying to write 13 bytes to <MACHINE_IP:34711>, fd is 15, errno=104 Connection reset by peer
> 03/24/15 19:13:14 Buf::write(): condor_write() failed
> 03/24/15 19:13:14 condor_read(): Socket closed when trying to read 5 bytes from <MACHINE_IP:34711> in non-blocking mode
> 03/24/15 19:13:14 IO: EOF reading packet header
> 03/24/15 19:13:14 condor_read(): Socket closed when trying to read 5 bytes from <MACHINE_IP:34711>
> 03/24/15 19:13:14 IO: EOF reading packet header
> 03/24/15 19:13:14 AUTHENTICATE: handshake failed!
> 03/24/15 19:13:14 DC_AUTHENTICATE: required authentication of 128.142.132.67 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXWRRJqi)|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/