[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Configuring a CE/Schedd



Hi Iain,

So, if this is in the schedd log, it's likely Arc trying to contact the Schedd?

That means we're looking at the ALLOW_READ statement and the SEC_*_AUTHENTICATION_METHODS (DEFAULT or READ).

However, since the authentication itself failed, it's probably not ALLOW_READ.

Picking apart the error message:

>> 03/24/15 19:13:14 DC_AUTHENTICATE: required authentication of 128.142.132.67 failed:

I assume this is localhost, right?

>> AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXWRRJqi)|

This is a curious one.  If Arc and the schedd are on the same filesystem, they should be able to communicate via /tmp.  Are you using any "filesystem magic" that might make the schedd and arc have unique /tmp mounts?

>> AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|

This is probably expected.

>> AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.


Possibly Arc doesn't have X509_USER_PROXY set either?

Brian

> On Mar 26, 2015, at 5:14 AM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
> 
> Hi Brain,
> 
> The messages are from /var/log/condor/SchedLog.
> 
> I figured it might have something to do with local permissions to certificate files etc.
> 
> It seems to be something unique to the scheduler daemon as I'm not experiencing this on the collector or the worker nodes. I've included a dump of all the schedd values in the condor config.(*)
> 
> The machine is definitely configured with a host certificate as it uses it when communicating with other infrastructure and the ARC CE also uses it.
> 
> The permissions of the certificate and key also match up with the other machines.
> 
> Thanks, Iain
> 
> (*)
> ~]# condor_config_val -expand -dump | grep SCHEDD
> ALLOW_NEGOTIATOR_SCHEDD = central-manager@xxxxxxx/*.cern.ch
> COLLECTOR.ALLOW_ADVERTISE_SCHEDD = computing-element@xxxxxxx/*.cern.ch,schedd@xxxxxxx/*.cern.ch
> DAEMON_LIST = MASTER, SHARED_PORT, SCHEDD
> GRIDMANAGER_CONTACT_SCHEDD_DELAY = 5
> MAX_NUM_SCHEDD_LOG = 10
> MAX_SCHEDD_LOG = 104857600
> SCHEDD = /usr/sbin/condor_schedd
> SCHEDD.ALLOW_READ =  *@cern.ch/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch,schedd@xxxxxxx/*.cern.ch,worker-node@xxxxxxx/*.cern.ch
> SCHEDD.ALLOW_WRITE = *@fsauth/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch
> SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = GSI,KERBEROS,FS
> SCHEDD_ADDRESS_FILE = /var/lib/condor/spool/.schedd_address
> SCHEDD_BACKUP_SPOOL = 
> SCHEDD_CRON_NAME = 
> SCHEDD_DAEMON_AD_FILE = /var/lib/condor/spool/.schedd_classad
> SCHEDD_DEBUG = D_PID
> SCHEDD_INTERVAL = 
> SCHEDD_JOB_QUEUE_LOG_FLUSH_DELAY = 5
> SCHEDD_LOG = /var/log/condor/SchedLog
> SCHEDD_MAX_FILE_DESCRIPTORS = 4096
> SCHEDD_MIN_INTERVAL = 5
> SCHEDD_NAME = 
> SCHEDD_PREEMPTION_RANK = 
> SCHEDD_PREEMPTION_REQUIREMENTS = 
> SCHEDD_QUERY_WORKERS = 6
> SCHEDD_ROUND_ATTR_ProportionalSetSizeKb = 25%
> SCHEDD_ROUND_ATTR_ResidentSetSize = 25%
> SCHEDD_SEND_VACATE_VIA_TCP = false
> SCHEDD_SUPER_ADDRESS_FILE = /var/lib/condor/spool/.schedd_address.super
> SCHEDDS = schedd@xxxxxxx/*.cern.ch
> SETTABLE_ATTRS_ADVERTISE_SCHEDD = 
> STATISTICS_WINDOW_QUANTUM_SCHEDD = 
> 
> ________________________________________
> From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Brian Bockelman [bbockelm@xxxxxxxxxxx]
> Sent: 25 March 2015 18:37
> To: HTCondor-Users Mail List
> Subject: Re: [HTCondor-users] Configuring a CE/Schedd
> 
> Hi Iain,
> 
>> From the message:
> 
>> 03/24/15 19:13:14 DC_AUTHENTICATE: required authentication of 128.142.132.67 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXWRRJqi)|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.
> 
> The important part of the message is:
> 
> "the remote (client) side was not able to acquire its credentials”
> 
> This indicates that the schedd isn’t using its certificate (or isn’t configured with one).
> 
> Is this from a shadow log?  If so, you don't want to be using any of these methods - you should be using match auth for your setup.  Perhaps that's something which got lost in the merge?
> 
> Brian
> 
>> On Mar 24, 2015, at 1:48 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
>> 
>> Hi,
>> 
>> I'm in the process of finalizing our CE/Schedd setup for our pool, we're using Puppet.
>> 
>> I had the CE working and acting as a scheduler with a manual config and decided to move it to the HEP-Puppet/htcondor module.
>> 
>> This is the output I get in SchedLog(*), I've removed the ip but it's the machine's own ip in all instances.
>> 
>> After this it just proceeds to spam condor_write errors until it fills the log file and starts a new one.
>> 
>> The ce is in the certificate mapfile along with all the other hosts and apart from the ordering of hostnames a vimdiff shows no difference between the security config file for this and the one that the central manager uses.
>> 
>> Has anyone else experienced this issue?
>> 
>> Thanks, Iain
>> 
>> (*)
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'ScheddIpAddr' <MACHINE_IP:9618?noUDP&sock=17305_aee5_3> == <MACHINE_IP:9618?noUDP&sock=17305_aee5_3>, but old logic couldn't find the command port for outbound interface MACHINE_IP.
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'ScheddIpAddr' address in ad (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>) == command socket (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>), but old logic couldn't find that command socket in its list.
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'MyAddress' <MACHINE_IP:9618?noUDP&sock=17305_aee5_3> == <MACHINE_IP:9618?noUDP&sock=17305_aee5_3>, but old logic couldn't find the command port for outbound interface MACHINE_IP.
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'MyAddress' address in ad (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>) == command socket (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>), but old logic couldn't find that command socket in its list.
>> 03/24/15 19:12:33 -------- Begin starting jobs --------
>> 03/24/15 19:12:33 -------- Done starting jobs --------
>> 03/24/15 19:13:14 Received a superuser command
>> 03/24/15 19:13:14 This process has a valid certificate & key
>> 03/24/15 19:13:14 Failed to read end of message from <MACHINE_IP:34711>; 1280 untouched bytes.
>> 03/24/15 19:13:14 condor_write(): Socket closed when trying to write 13 bytes to <MACHINE_IP:34711>, fd is 15, errno=104 Connection reset by peer
>> 03/24/15 19:13:14 Buf::write(): condor_write() failed
>> 03/24/15 19:13:14 condor_read(): Socket closed when trying to read 5 bytes from <MACHINE_IP:34711> in non-blocking mode
>> 03/24/15 19:13:14 IO: EOF reading packet header
>> 03/24/15 19:13:14 condor_read(): Socket closed when trying to read 5 bytes from <MACHINE_IP:34711>
>> 03/24/15 19:13:14 IO: EOF reading packet header
>> 03/24/15 19:13:14 AUTHENTICATE: handshake failed!
>> 03/24/15 19:13:14 DC_AUTHENTICATE: required authentication of 128.142.132.67 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXWRRJqi)|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.
>> 
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/