[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Configuring a CE/Schedd



Hi Luke,

Yes that's the case with ARC it uses the condor binaries.

The CERTIFICATE_MAPFILE maps the hostnames to a canonical representation of their function.
The schedd allow read and allow write is created using the canonical names.

It appears to have been an approach implemented temporarily by my predecessor whilst they came up with a better approach.

It's actually something we're planning on spending time on make it less of a pain point.

As it happens it appears to have been down to a mis-setting in the ALLOW_READ and ALLOW_WRITE sections. It now works and I've been able to submit jobs again so thank you to Brian and you for your advice and suggestions.

Thanks,

Iain

From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of L Kreczko [L.Kreczko@xxxxxxxxxxxxx]
Sent: 27 March 2015 12:14
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Configuring a CE/Schedd

Hi Ian,

If the ARC is the only one that is using that scheduler, then the messages will disappear. But it does not mean that the ARC is causing it.
AFAIK the ARC will call condor_submit to submit the jobs to the cluster, so you should see similar messages with any condor job you submit.
Unfortunately I do not have the logs from the time I was experiencing that error, but they looked very similar.
You should test the permissions. I am a bit confused about the users you mention in the allow statement:
SCHEDD.ALLOW_READ =  *@cern.ch/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch,schedd@xxxxxxx/*.cern.ch,worker-node@xxxxxxx/*.cern.ch
SCHEDD.ALLOW_WRITE = *@fsauth/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch
The equivalent on my site contains only 'condor' and 'condor_pool' (current puppet would configure it as such). I run condor as condor user, not root and back when I was running as root these lines would contain 'root'. But then again, I don't think I fully understand that part of the config.

What does
condor_ping -addr "<127.0.0.1:9618>" -table READ WRITE DAEMON
give you on the CE?
For me it looks like:
         Instruction Authentication Encryption Integrity Decision Identity
                READ             FS   BLOWFISH       MD5    ALLOW condor@xxxxxxxxxxxxxx
               WRITE             FS   BLOWFISH       MD5    ALLOW condor@xxxxxxxxxxxxxx
              DAEMON             FS   BLOWFISH       MD5    ALLOW condor@xxxxxxxxxxxxxx

Cheers,
Luke


On 27 March 2015 at 09:31, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
Hi Brian,

I presume it must be ARC as when I turned ARC off the messages didn't appear.

Just for testing purposes I attempted to open it up so that everything was available on SCHEDD.ALLOW_WRITE/READ

SCHEDD.ALLOW_READ = *@cern.ch/ce-iain.cern.ch, *@cern.ch/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch,schedd@xxxxxxx/*.cern.ch,worker-node@xxxxxxx/*.cern.ch
SCHEDD.ALLOW_WRITE = *@fsauth/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch,*@cern.ch/ce-iain.cern.ch, *@cern.ch/ce501.cern.ch
SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = GSI,KERBEROS,FS

Yes that is indeed the ce's own ip.

I took a look in /tmp and there are no FS* files listed. Is it condor that creates them? Could it be a permissions issue? (See [1] for an example)

I have CONDOR_IDS set to 0.0 as well.

I'm not aware of any /tmp filesytem trickery, it's a virtual node.[2]

In the [common] section of arc.conf it defines the following:

x509_user_key="/etc/grid-security/hostkey.pem"
x509_user_cert="/etc/grid-security/hostcert.pem"
x509_cert_dir="/etc/grid-security/certificates"
gridmap="/etc/grid-security/voms-grid-mapfile"

Those files all have the correct permissions on them.

Thanks Iain.

[1]
03/27/15 10:14:32 (pid:25852) Received a superuser command
03/27/15 10:14:32 (pid:25852) condor_write(): Socket closed when trying to write 13 bytes to <128.142.132.67:32931>, fd is 15, errno=104 Connection reset by peer
03/27/15 10:14:32 (pid:25852) Buf::write(): condor_write() failed
03/27/15 10:14:32 (pid:25852) AUTHENTICATE: handshake failed!
03/27/15 10:14:32 (pid:25852) DC_AUTHENTICATE: required authentication of 128.142.132.67 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXX8epCBM)|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.

[2]
~]# cat /proc/mounts
rootfs / rootfs rw 0 0
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,seclabel,relatime 0 0
devtmpfs /dev devtmpfs rw,seclabel,relatime,size=8154428k,nr_inodes=2038607,mode=755 0 0
devpts /dev/pts devpts rw,seclabel,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /dev/shm tmpfs rw,seclabel,relatime 0 0
/dev/mapper/VolGroup00-LogVol00 / ext4 rw,seclabel,relatime,barrier=1,data="" 0 0
none /selinux selinuxfs rw,relatime 0 0
devtmpfs /dev devtmpfs rw,seclabel,relatime,size=8154428k,nr_inodes=2038607,mode=755 0 0
/proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0
/dev/vda1 /boot ext4 rw,seclabel,relatime,barrier=1,data="" 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
AFS /afs afs rw,relatime 0 0

________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Brian Bockelman [bbockelm@xxxxxxxxxxx]
Sent: 26 March 2015 15:30
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Configuring a CE/Schedd

Hi Iain,

So, if this is in the schedd log, it's likely Arc trying to contact the Schedd?

That means we're looking at the ALLOW_READ statement and the SEC_*_AUTHENTICATION_METHODS (DEFAULT or READ).

However, since the authentication itself failed, it's probably not ALLOW_READ.

Picking apart the error message:

>> 03/24/15 19:13:14 DC_AUTHENTICATE: required authentication of 128.142.132.67 failed:

I assume this is localhost, right?

>> AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXWRRJqi)|

This is a curious one.  If Arc and the schedd are on the same filesystem, they should be able to communicate via /tmp.  Are you using any "filesystem magic" that might make the schedd and arc have unique /tmp mounts?

>> AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|

This is probably expected.

>> AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.


Possibly Arc doesn't have X509_USER_PROXY set either?

Brian

> On Mar 26, 2015, at 5:14 AM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
>
> Hi Brain,
>
> The messages are from /var/log/condor/SchedLog.
>
> I figured it might have something to do with local permissions to certificate files etc.
>
> It seems to be something unique to the scheduler daemon as I'm not experiencing this on the collector or the worker nodes. I've included a dump of all the schedd values in the condor config.(*)
>
> The machine is definitely configured with a host certificate as it uses it when communicating with other infrastructure and the ARC CE also uses it.
>
> The permissions of the certificate and key also match up with the other machines.
>
> Thanks, Iain
>
> (*)
> ~]# condor_config_val -expand -dump | grep SCHEDD
> ALLOW_NEGOTIATOR_SCHEDD = central-manager@xxxxxxx/*.cern.ch
> COLLECTOR.ALLOW_ADVERTISE_SCHEDD = computing-element@xxxxxxx/*.cern.ch,schedd@xxxxxxx/*.cern.ch
> DAEMON_LIST = MASTER, SHARED_PORT, SCHEDD
> GRIDMANAGER_CONTACT_SCHEDD_DELAY = 5
> MAX_NUM_SCHEDD_LOG = 10
> MAX_SCHEDD_LOG = 104857600
> SCHEDD = /usr/sbin/condor_schedd
> SCHEDD.ALLOW_READ =  *@cern.ch/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch,schedd@xxxxxxx/*.cern.ch,worker-node@xxxxxxx/*.cern.ch
> SCHEDD.ALLOW_WRITE = *@fsauth/ce501.cern.ch,central-manager@xxxxxxx/*.cern.ch,computing-element@xxxxxxx/*.cern.ch
> SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = GSI,KERBEROS,FS
> SCHEDD_ADDRESS_FILE = /var/lib/condor/spool/.schedd_address
> SCHEDD_BACKUP_SPOOL =
> SCHEDD_CRON_NAME =
> SCHEDD_DAEMON_AD_FILE = /var/lib/condor/spool/.schedd_classad
> SCHEDD_DEBUG = D_PID
> SCHEDD_INTERVAL =
> SCHEDD_JOB_QUEUE_LOG_FLUSH_DELAY = 5
> SCHEDD_LOG = /var/log/condor/SchedLog
> SCHEDD_MAX_FILE_DESCRIPTORS = 4096
> SCHEDD_MIN_INTERVAL = 5
> SCHEDD_NAME =
> SCHEDD_PREEMPTION_RANK =
> SCHEDD_PREEMPTION_REQUIREMENTS =
> SCHEDD_QUERY_WORKERS = 6
> SCHEDD_ROUND_ATTR_ProportionalSetSizeKb = 25%
> SCHEDD_ROUND_ATTR_ResidentSetSize = 25%
> SCHEDD_SEND_VACATE_VIA_TCP = false
> SCHEDD_SUPER_ADDRESS_FILE = /var/lib/condor/spool/.schedd_address.super
> SCHEDDS = schedd@xxxxxxx/*.cern.ch
> SETTABLE_ATTRS_ADVERTISE_SCHEDD =
> STATISTICS_WINDOW_QUANTUM_SCHEDD =
>
> ________________________________________
> From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Brian Bockelman [bbockelm@xxxxxxxxxxx]
> Sent: 25 March 2015 18:37
> To: HTCondor-Users Mail List
> Subject: Re: [HTCondor-users] Configuring a CE/Schedd
>
> Hi Iain,
>
>> From the message:
>
>> 03/24/15 19:13:14 DC_AUTHENTICATE: required authentication of 128.142.132.67 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXWRRJqi)|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.
>
> The important part of the message is:
>
> "the remote (client) side was not able to acquire its credentials”
>
> This indicates that the schedd isn’t using its certificate (or isn’t configured with one).
>
> Is this from a shadow log?  If so, you don't want to be using any of these methods - you should be using match auth for your setup.  Perhaps that's something which got lost in the merge?
>
> Brian
>
>> On Mar 24, 2015, at 1:48 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
>>
>> Hi,
>>
>> I'm in the process of finalizing our CE/Schedd setup for our pool, we're using Puppet.
>>
>> I had the CE working and acting as a scheduler with a manual config and decided to move it to the HEP-Puppet/htcondor module.
>>
>> This is the output I get in SchedLog(*), I've removed the ip but it's the machine's own ip in all instances.
>>
>> After this it just proceeds to spam condor_write errors until it fills the log file and starts a new one.
>>
>> The ce is in the certificate mapfile along with all the other hosts and apart from the ordering of hostnames a vimdiff shows no difference between the security config file for this and the one that the central manager uses.
>>
>> Has anyone else experienced this issue?
>>
>> Thanks, Iain
>>
>> (*)
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'ScheddIpAddr' <MACHINE_IP:9618?noUDP&sock=17305_aee5_3> == <MACHINE_IP:9618?noUDP&sock=17305_aee5_3>, but old logic couldn't find the command port for outbound interface MACHINE_IP.
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'ScheddIpAddr' address in ad (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>) == command socket (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>), but old logic couldn't find that command socket in its list.
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'MyAddress' <MACHINE_IP:9618?noUDP&sock=17305_aee5_3> == <MACHINE_IP:9618?noUDP&sock=17305_aee5_3>, but old logic couldn't find the command port for outbound interface MACHINE_IP.
>> 03/24/15 19:12:28 Address rewriting: Warning: attribute 'MyAddress' address in ad (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>) == command socket (<MACHINE_IP:9618?noUDP&sock=17305_aee5_3>), but old logic couldn't find that command socket in its list.
>> 03/24/15 19:12:33 -------- Begin starting jobs --------
>> 03/24/15 19:12:33 -------- Done starting jobs --------
>> 03/24/15 19:13:14 Received a superuser command
>> 03/24/15 19:13:14 This process has a valid certificate & key
>> 03/24/15 19:13:14 Failed to read end of message from <MACHINE_IP:34711>; 1280 untouched bytes.
>> 03/24/15 19:13:14 condor_write(): Socket closed when trying to write 13 bytes to <MACHINE_IP:34711>, fd is 15, errno=104 Connection reset by peer
>> 03/24/15 19:13:14 Buf::write(): condor_write() failed
>> 03/24/15 19:13:14 condor_read(): Socket closed when trying to read 5 bytes from <MACHINE_IP:34711> in non-blocking mode
>> 03/24/15 19:13:14 IO: EOF reading packet header
>> 03/24/15 19:13:14 condor_read(): Socket closed when trying to read 5 bytes from <MACHINE_IP:34711>
>> 03/24/15 19:13:14 IO: EOF reading packet header
>> 03/24/15 19:13:14 AUTHENTICATE: handshake failed!
>> 03/24/15 19:13:14 DC_AUTHENTICATE: required authentication of 128.142.132.67 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXWRRJqi)|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
*********************************************************
  Dr Lukasz Kreczko            +44 (0)117 928 8724  
  CMS Group
  School of Physics
  University of Bristol
*********************************************************