[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Unable to start EC2 instance



I believe with the new ec2_gahp you need "grid_resource = ec2 https://ec2.amazonaws.com/";

Best,


matt

On 06/23/2011 07:30 PM, Philip Papadopoulos wrote:
Still no love....
I git cloned the head of the condor tree, and remade
copied condor_submit, condor_gridmanager, and ec2_gaph in bin, sbin, sbin

I changed the condor config to use the new gahp.
$ condor_config_val -dump | grep AMAZON
AMAZON_GAHP = $(SBIN)/ec2_gahp
AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_AMAZON = 20

And then submitted with
universe = grid
grid_resource = amazon https://ec2.amazonaws.com/
periodic_release = NumHolds < 3
+NumHolds = 0
periodic_remove = NumHolds >= 3 || (JobStatus == 2 && time()-ShadowBday
 > 1*60*60)
executable = RunEC2VM
amazon_keypair_file = keypair.$(Process)

amazon_ami_id = ami-4ed12d27
amazon_instance_type = m1.large
amazon_user_data = condor:landphil.rocksclusters.org:40000:50000
amazon_private_key = /home/phil/.ec2/pk.pem
amazon_public_key = /home/phil/.ec2/cert.pem

queue 1\

as before.

GridManager.log shows
06/23/11 16:21:27 Setting maximum accepts per cycle 4.
06/23/11 16:21:29 [27034] ================================>
AmazonJob::AmazonJob 1
06/23/11 16:21:29 [27034] Found job 2.0 --- inserting
06/23/11 16:21:29 [27034] gahp server not up yet, delaying ping
06/23/11 16:21:29 [27034] (2.0) doEvaluateState called: gmState GM_INIT,
condorState 1
06/23/11 16:21:29 [27034] GAHP server pid = 27038
06/23/11 16:21:34 [27034] ERROR "Bad AMAZON_VM_STATUS_ALL Request: E" at
line 2256 in file
/state/partition1/condor/src/condor_gridmanager/gahp-client.cpp

 From this same node, I can use ec2-native tools to start stop query
instances
e.g
$ ec2-describe-instances
RESERVATION     r-ef3f0283      126101316194    default
INSTANCE        i-d91433b7      ami-4ed12d27
ec2-50-17-131-129.compute-1.amazonaws.com
<http://ec2-50-17-131-129.compute-1.amazonaws.com>
ip-10-110-235-155.ec2.internal      running         0
m1.large        2011-06-23T23:05:41+0000        us-east-1c
aki-e5c1218c       monitoring-disabled      50.17.131.129
10.110.235.155
instance-store                                  paravirtual
xen             sg-427ca02b


and

ec2-terminate-instances i-d91433b7
INSTANCE        i-d91433b7      running shutting-down


-P







On Thu, Jun 23, 2011 at 7:41 AM, Philip Papadopoulos
<philip.papadopoulos@xxxxxxxxx <mailto:philip.papadopoulos@xxxxxxxxx>>
wrote:


    I will try that when I get in this AM (I'm on the west coast) and
    report back.
    Thanks,
    Phil

    On Thu, Jun 23, 2011 at 7:34 AM, Timothy St. Clair
    <tstclair@xxxxxxxxxx <mailto:tstclair@xxxxxxxxxx>> wrote:

        You could extract the condor_submit + gridmanager + ec2_gahp..

        Cheers,
        Tim

        On Thu, 2011-06-23 at 07:26 -0700, Philip Papadopoulos wrote:
         > Do I need all of condor 7.7 or can I just extract the ec2_gahp
         > executable from it?
         >
         > Thanks,
         > Phil
         >
         >
         >
         > On Thu, Jun 23, 2011 at 4:56 AM, Matthew Farrellee
        <matt@xxxxxxxxxx <mailto:matt@xxxxxxxxxx>>
         > wrote:
         >
         >         On 06/22/2011 02:49 PM, Philip Papadopoulos wrote:
         >
         >
         >                 Trying out Condor 7.6.1 -- installed via the
         >                 rhap.stripped.tar.gz
         >
         >                 I get the following in my GAHP log.
         >                 06/22/11 09:33:37
        Command(AMAZON_VM_STATUS_ALL) got
         >                 error(code:Client,
         >                 msg:End of file or no input: Operation
        interrupted or
         >                 timed out
         >                 06/22/11 09:38:38 Call to DescribeInstances
        failed:
         >                 SOAP 1.1 fault:
         >                 SOAP-ENV:Client [no subcode]
         > "End of file or no input: Operation interrupted or
         >                 timed out"
         >                 Detail: [no detail]
         >
         >                 06/22/11 09:38:38
        Command(AMAZON_VM_STATUS_ALL) got
         >                 error(code:Client,
         >                 msg:End of file or no input: Operation
        interrupted or
         >                 timed out
         >                 06/22/11 09:42:08 EOF reached on pipe 0
         >                 06/22/11 09:42:08 stdin buffer closed, exiting
         >                 06/22/11 09:47:19 Call to DescribeInstances
        failed:
         >                 SOAP 1.1 fault:
         >                 SOAP-ENV:Client [no subcode]
         > "End of file or no input: Operation interrupted or
         >                 timed out"
         >                 Detail: [no detail]
         >
         >                 06/22/11 09:47:19
        Command(AMAZON_VM_STATUS_ALL) got
         >                 error(code:Client,
         >                 msg:End of file or no input: Operation
        interrupted or
         >                 timed out
         >                 06/22/11 09:48:33 EOF reached on pipe 0
         >                 06/22/11 09:48:33 stdin buffer closed, exiting
         >                 06/22/11 09:49:18 Call to DescribeInstances
        failed:
         >                 SOAP 1.1 fault:
         >                 SOAP-ENV:Client [no subcode]
         > "End of file or no input: Operation interrupted or
         >                 timed out"
         >                 Detail: [no detail]
         >
         >                 06/22/11 09:49:18
        Command(AMAZON_VM_STATUS_ALL) got
         >                 error(code:Client,
         >                 msg:End of file or no input: Operation
        interrupted or
         >                 timed out
         >
         >
         >                 The submission file is simple:
         >                 universe = grid
         >                 grid_resource = amazon https://ec2.amazonaws.com/
         >                 periodic_release = NumHolds < 3
         >                 +NumHolds = 0
         >                 periodic_remove = NumHolds >= 3 || (JobStatus
        == 2 &&
         >                 time()-ShadowBday
         > > 1*60*60)
         >                 executable = RunEC2VM
         >                 amazon_keypair_file = keypair.$(Process)
         >
         >                 amazon_ami_id = ami-4ed12d27
         >                 amazon_instance_type = m1.large
         >                 amazon_user_data =
         >                 condor:landphil.rocksclusters.org:40000:50000
         >                 amazon_private_key = /home/phil/.ec2/pk.pem
         >                 amazon_public_key = /home/phil/.ec2/cert.pem
         >
         >                 queue 1
         >
         >
         >                 And the condor_config_val  (The salient ones
        I think)
         >                 $ condor_config_val -dump | grep -i amazon
         >                 AMAZON_GAHP = $(SBIN)/amazon_gahp
         >                 AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
         >
        GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_AMAZON =
         >                 20
         >
         >                 and
         >                 $ condor_config_val -dump | grep -i ssl
         >                 SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
         >                 SOAP_SSL_SKIP_HOST_CHECK = True
         >
         >                 I've tried both with an without
         >                 SOAP_SSL_SKIP_HOST_CHECK.
         >                 the SSL_CA_FILE exists
         >                 If I try WITHOUT the
         >                 SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
         >                 then I get
         >                  Call to DescribeInstances failed: SOAP 1.1
        fault:
         >                 SOAP-ENV:Client [no
         >                 subcode]
         > "SSL_ERROR_SSL
         >                 error:14090086:SSL
         >                 routines:SSL3_GET_SERVER_CERTIFICATE:certificate
         >                 verify failed"
         >                 Detail: SSL connect failed in tcp_connect()
         >
         >
         >                 Right now I'm flumoxed.
         >
         >                 Thanks,
         >                 Phil
         >
         >                 --
         >                 Philip Papadopoulos, PhD
         >                 University of California, San Diego
         >
         > 858-822-3628 <tel:858-822-3628> <tel:858-822-3628
        <tel:858-822-3628>> (Ofc)
         > 619-331-2990 <tel:619-331-2990> <tel:619-331-2990
        <tel:619-331-2990>> (Fax)
         >
         >         Phil,
         >
         >         Assuming you aren't getting those errors 100% of the
        time, and
         >         you're actually talking to AWS's EC2 service.
         >
         >         I've seen similar intermittent issues in the past.
        They came
         >         and went by days. After much investigation, I eventually
         >         chalked them up to transient issues with AWS' EC2 SOAP
         >         interface. The amazon_gahp was Condor's first means to
         >         interact with EC2 and was written to the (then
        popular) SOAP
         >         interface. Over the years the EC2 Query interface has
         >         apparently taken hold as the interface of choice,
        with many
         >         EC2 clones not supporting SOAP. In response, the
        ec2_gahp has
         >         been written, available in 7.7, against the Query
        interface.
         >         You should try it out, especially on a day when the SOAP
         >         interface is failing so that we might get a better
        handle on
         >         if the issue is truly SOAP v Query.
         >
         >         Best,
         >
         >
         >         matt
         >
         >
         >
         > --
         > Philip Papadopoulos, PhD
         > University of California, San Diego
         > 858-822-3628 <tel:858-822-3628> (Ofc)
         > 619-331-2990 <tel:619-331-2990> (Fax)
         > _______________________________________________
         > Condor-users mailing list
         > To unsubscribe, send a message to
        condor-users-request@xxxxxxxxxxx
        <mailto:condor-users-request@xxxxxxxxxxx> with a
         > subject: Unsubscribe
         > You can also unsubscribe by visiting
         > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
         >
         > The archives can be found at:
         > https://lists.cs.wisc.edu/archive/condor-users/

        _______________________________________________
        Condor-users mailing list
        To unsubscribe, send a message to
        condor-users-request@xxxxxxxxxxx
        <mailto:condor-users-request@xxxxxxxxxxx> with a
        subject: Unsubscribe
        You can also unsubscribe by visiting
        https://lists.cs.wisc.edu/mailman/listinfo/condor-users

        The archives can be found at:
        https://lists.cs.wisc.edu/archive/condor-users/




    --
    Philip Papadopoulos, PhD
    University of California, San Diego
    858-822-3628 <tel:858-822-3628> (Ofc)
    619-331-2990 <tel:619-331-2990> (Fax)




--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)