[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] submitting jobs with API
- Date: Tue, 19 Dec 2017 21:14:00 -0600
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] submitting jobs with API
This is definitely an issue with the security subsystem, not the python API. I suspect that you can reproduce it via the command line tools with something like:
condor_submit -remote <schedd name> -pool <collector name> submit_file
Sometimes it's a bit simpler to increase the logging via the CLI (the error messages don't always come back in a usable manner for the python API).
If you can reproduce it with condor_submit, try:
_condor_TOOL_DEBUG=D_SECURITY,D_FULLDEBUG condor_submit -debug -remote <schedd name> -pool <collector name> submit_file
That should provide a full readout of the security handshake.
The puzzling thing is that this line:
use SECURITY : HOST_BASED
in your server config (oh - did you do a condor_reconfig after the change?) should theoretically disable the attempts to do GSI-based security negotiation. However, the logfiles clearly show it is being attempted.
So -- this suggests something slightly wrong with the schedd configuration, but it's not clear what is wrong yet.
> On Dec 19, 2017, at 11:25 AM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
> This is what was logged in SchedLog in the submit attempt. Note I have
> these security related settings in my config file. Do I need other
> settings to allow this to work?
> use SECURITY : HOST_BASED
> ALLOW_WRITE = 192.168.*
> ALLOW_READ = 192.168.*
> 12/19/17 11:13:13 (pid:32123) authenticate_self_gss: acquiring self
> credentials failed. Please check your Condor configuration file if
> this is a server process. Or the user environment variable if this is
> a user process.
> GSS Major Status: General failure
> GSS Minor Status Error Chain:
> globus_gsi_gssapi: Error with GSI credential
> globus_gsi_gssapi: Error with gss credential handle
> globus_credential: Valid credentials could not be found in any of the
> possible locations specified by the credential search order.
> Valid credentials could not be found in any of the possible locations
> specified by the credential search order.
> Attempt 1
> globus_credential: Error reading host credential
> globus_sysconfig: Could not find a valid certificate file: The host
> cert could not be found in:
> 1) env. var. X509_USER_CERT
> 2) /etc/grid-security/hostcert.pem
> 3) $GLOBUS_LOCATION/etc/hostcert.pem
> 4) $HOME/.globus/hostcert.pem
> The host key could not be found in:
> 1) env. var. X509_USER_KEY
> 2) /etc/grid-security/hostkey.pem
> 3) $GLOBUS_LOCATION/etc/hostkey.pem
> 4) $HOME/.globus/hostkey.pem
> Attempt 2
> globus_credential: Error reading proxy credential
> globus_sysconfig: Could not find a valid proxy certificate file location
> globus_sysconfig: Error with key filename
> globus_sysconfig: File does not exist: /tmp/x509up_u0 is not a valid file
> Attempt 3
> globus_credential: Error reading user credential
> globus_sysconfig: Error with certificate filename: The user cert could
> not be found in:
> 1) env. var. X509_USER_CERT
> 2) $HOME/.globus/usercert.pem
> 3) $HOME/.globus/usercred.p12
> 12/19/17 11:13:13 (pid:32123) DC_AUTHENTICATE: authentication of
> <192.168.10.15:45684> did not result in a valid mapped user name,
> which is required for this command (1112 QMGMT_WRITE_CMD), so
> 12/19/17 11:13:13 (pid:32123) DC_AUTHENTICATE: reason for
> authentication failure: AUTHENTICATE:1003:Failed to authenticate with
> any method|AUTHENTICATE:1004:Failed to authenticate using
> GSI|GSI:5003:Failed to authenticate. Globus is reporting error
> (851968:152). There is probably a problem with your credentials.
> (Did you run grid-proxy-init?)|AUTHENTICATE:1004:Failed to
> authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate
> using FS|FS:1004:Unable to lstat(/tmp/FS_XXX4oulm8)
> On Tue, Dec 19, 2017 at 10:33 AM, Jason Patton <jpatton@xxxxxxxxxxx> wrote:
>> I don't have a solution, but hopefully I can help get the ball rolling.
>> Without modifying my schedd config, I tried doing a remote submit following
>> the same steps, which failed with the same error. The error is a little
>> misleading/light on details, it's likely an authentication problem from not
>> being on the same system as the schedd. Doing essentially the same thing
>> using the client tools gives more info:
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> RuntimeError: Failed to connect to schedd.
>> $ condor_submit test.submit -remote condor-el7.test
>> Submitting job(s)
>> ERROR: Failed to connect to queue manager condor-el7.test
>> AUTHENTICATE:1003:Failed to authenticate with any method
>> AUTHENTICATE:1004:Failed to authenticate using GSI
>> GSI:5003:Failed to authenticate. Globus is reporting error (851968:50).
>> There is probably a problem with your credentials. (Did you run
>> AUTHENTICATE:1004:Failed to authenticate using KERBEROS
>> AUTHENTICATE:1004:Failed to authenticate using FS
>> You should see more details in SchedLog on your submit host.
>> Hopefully someone more knowledgable about setting up the schedd to accept
>> remote job submissions can chime in. (ENABLE_SOAP and ENABLE_WEB_SERVER are
>> probably not needed.)
>> On Tue, Dec 19, 2017 at 9:02 AM, Larry Martell <larry.martell@xxxxxxxxx>
>>> On Tue, Dec 19, 2017 at 9:29 AM, Larry Martell <larry.martell@xxxxxxxxx>
>>>> I am doing this:
>>>> import htcondor
>>>> import classad
>>>> condor_host = '192.168.10.2'
>>>> coll = htcondor.Collector(condor_host)
>>>> schedd_ad = coll.locate(htcondor.DaemonTypes.Schedd)
>>>> schedd = htcondor.Schedd(schedd_ad)
>>>> ad = classad.ClassAd()
>>>> # set up ad
>>>> id = schedd.submit(ad)
>>>> RuntimeError: 'Failed to connect to schedd.'
>>>> On 192.168.10.2:
>>>> 4 S condor 32054 1 0 80 0 - 18610 poll_s Dec12 ?
>>>> 00:00:15 /usr/sbin/condor_master -f
>>>> 4 S root 32112 32054 0 80 0 - 6652 poll_s Dec12 ?
>>>> 00:07:51 condor_procd -A /var/run/condor/procd_pipe -L
>>>> /var/log/condor/ProcLog -R 1000000 -S 60 -C 986
>>>> 4 S condor 32113 32054 0 80 0 - 13531 poll_s Dec12 ?
>>>> 00:00:44 condor_shared_port -f
>>>> 4 S condor 32117 32054 0 80 0 - 20511 poll_s Dec12 ?
>>>> 00:07:46 condor_collector -f
>>>> 4 S condor 32122 32054 0 80 0 - 15856 poll_s Dec12 ?
>>>> 00:31:40 condor_negotiator -f
>>>> 4 S condor 32123 32054 0 80 0 - 18808 poll_s Dec12 ?
>>>> 00:00:31 condor_schedd -f
>>>> From the machine running the python code:
>>>> $ nmap -p 9618 192.168.10.2
>>>> Starting Nmap 6.40 ( http://nmap.org ) at 2017-12-19 09:28 EST
>>>> Nmap scan report for 192.168.10.2
>>>> Host is up (0.00018s latency).
>>>> PORT STATE SERVICE
>>>> 9618/tcp open condor
>>>> Am I doing something wrong or missing something?
>>> Also let me add I have these settings in the config file:
>>> ENABLE_SOAP = True
>>> ENABLE_WEB_SERVER = True
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at: