[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Unable to start EC2 instance



You'll need a 7.7 condor_submit, condor_gridmanager and ec2_gahp.

Best,


matt

On 06/23/2011 10:26 AM, Philip Papadopoulos wrote:
Do I need all of condor 7.7 or can I just extract the ec2_gahp
executable from it?
Thanks,
Phil


On Thu, Jun 23, 2011 at 4:56 AM, Matthew Farrellee <matt@xxxxxxxxxx
<mailto:matt@xxxxxxxxxx>> wrote:

    On 06/22/2011 02:49 PM, Philip Papadopoulos wrote:

        Trying out Condor 7.6.1 -- installed via the rhap.stripped.tar.gz

        I get the following in my GAHP log.
        06/22/11 09:33:37 Command(AMAZON_VM_STATUS_ALL) got
        error(code:Client,
        msg:End of file or no input: Operation interrupted or timed out
        06/22/11 09:38:38 Call to DescribeInstances failed: SOAP 1.1 fault:
        SOAP-ENV:Client [no subcode]
        "End of file or no input: Operation interrupted or timed out"
        Detail: [no detail]

        06/22/11 09:38:38 Command(AMAZON_VM_STATUS_ALL) got
        error(code:Client,
        msg:End of file or no input: Operation interrupted or timed out
        06/22/11 09:42:08 EOF reached on pipe 0
        06/22/11 09:42:08 stdin buffer closed, exiting
        06/22/11 09:47:19 Call to DescribeInstances failed: SOAP 1.1 fault:
        SOAP-ENV:Client [no subcode]
        "End of file or no input: Operation interrupted or timed out"
        Detail: [no detail]

        06/22/11 09:47:19 Command(AMAZON_VM_STATUS_ALL) got
        error(code:Client,
        msg:End of file or no input: Operation interrupted or timed out
        06/22/11 09:48:33 EOF reached on pipe 0
        06/22/11 09:48:33 stdin buffer closed, exiting
        06/22/11 09:49:18 Call to DescribeInstances failed: SOAP 1.1 fault:
        SOAP-ENV:Client [no subcode]
        "End of file or no input: Operation interrupted or timed out"
        Detail: [no detail]

        06/22/11 09:49:18 Command(AMAZON_VM_STATUS_ALL) got
        error(code:Client,
        msg:End of file or no input: Operation interrupted or timed out


        The submission file is simple:
        universe = grid
        grid_resource = amazon https://ec2.amazonaws.com/
        periodic_release = NumHolds < 3
        +NumHolds = 0
        periodic_remove = NumHolds >= 3 || (JobStatus == 2 &&
        time()-ShadowBday
         > 1*60*60)
        executable = RunEC2VM
        amazon_keypair_file = keypair.$(Process)

        amazon_ami_id = ami-4ed12d27
        amazon_instance_type = m1.large
        amazon_user_data = condor:landphil.rocksclusters.__org:40000:50000
        amazon_private_key = /home/phil/.ec2/pk.pem
        amazon_public_key = /home/phil/.ec2/cert.pem

        queue 1


        And the condor_config_val  (The salient ones I think)
        $ condor_config_val -dump | grep -i amazon
        AMAZON_GAHP = $(SBIN)/amazon_gahp
        AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
        GRIDMANAGER_MAX_SUBMITTED___JOBS_PER_RESOURCE_AMAZON = 20

        and
        $ condor_config_val -dump | grep -i ssl
        SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
        SOAP_SSL_SKIP_HOST_CHECK = True

        I've tried both with an without SOAP_SSL_SKIP_HOST_CHECK.
        the SSL_CA_FILE exists
        If I try WITHOUT the
        SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
        then I get
          Call to DescribeInstances failed: SOAP 1.1 fault:
        SOAP-ENV:Client [no
        subcode]
        "SSL_ERROR_SSL
        error:14090086:SSL
        routines:SSL3_GET_SERVER___CERTIFICATE:certificate
        verify failed"
        Detail: SSL connect failed in tcp_connect()


        Right now I'm flumoxed.

        Thanks,
        Phil

        --
        Philip Papadopoulos, PhD
        University of California, San Diego
        858-822-3628 <tel:858-822-3628> <tel:858-822-3628
        <tel:858-822-3628>> (Ofc)
        619-331-2990 <tel:619-331-2990> <tel:619-331-2990
        <tel:619-331-2990>> (Fax)


    Phil,

    Assuming you aren't getting those errors 100% of the time, and
    you're actually talking to AWS's EC2 service.

    I've seen similar intermittent issues in the past. They came and
    went by days. After much investigation, I eventually chalked them up
    to transient issues with AWS' EC2 SOAP interface. The amazon_gahp
    was Condor's first means to interact with EC2 and was written to the
    (then popular) SOAP interface. Over the years the EC2 Query
    interface has apparently taken hold as the interface of choice, with
    many EC2 clones not supporting SOAP. In response, the ec2_gahp has
    been written, available in 7.7, against the Query interface. You
    should try it out, especially on a day when the SOAP interface is
    failing so that we might get a better handle on if the issue is
    truly SOAP v Query.

    Best,


    matt




--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)