[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] [Fwd: Re: Unable to start EC2 instance]



For completeness.. It looks like I just sent to phil
--- Begin Message ---
Universe = grid
grid_resource = ec2 https://ec2.amazonaws.com/

# Executable in this context is just a label for the job
Executable  = my_ec2_test_job
transfer_executable = false
Log=$(cluster).ec2.log
Iwd=/tmp

#input 
ec2_ami_id = 
ec2_instance_type = 
ec2_security_groups=                                                                                                                                                        
ec2_access_key_id = <YOUR_LOC>/ec2.aid                                 
ec2_secret_access_key = <YOUR_LOC>/ec2.key       
#optional                                                                                                            
#ec2_elastic_ip = 

# in upstream src only, not yet released
# ec2_ebs_volumes =
# ec2_availability_zone =

#safe-loc-output                       
ec2_keypair_file = <YOUR_LOC>/test1.pem                                

Hope this helps, 
Tim

On Thu, 2011-06-23 at 19:40 -0700, Philip Papadopoulos wrote:
> Closer, but not quite there.
> 
> [root@vizagra ~]# tail -f /var/opt/condor/log/GridmanagerLog.phil 
> 06/23/11 19:37:22 [25245] Found job 8.0 --- inserting
> 06/23/11 19:37:22 [25245] gahp server not up yet, delaying ping
> 06/23/11 19:37:22 [25245] (8.0) doEvaluateState called: gmState
> GM_INIT, condorState 1
> 06/23/11 19:37:22 [25245] GAHP server pid = 25247
> 06/23/11 19:37:28 [25245] resource https://ec2.amazonaws.com/ is now
> up
> 06/23/11 19:37:28 [25245] (8.0) doEvaluateState called: gmState
> GM_CHECK_VM, condorState 1
> 06/23/11 19:37:28 [25245] (8.0) doEvaluateState called: gmState
> GM_CHECK_VM, condorState 1
> 06/23/11 19:37:29 [25245] (8.0) doEvaluateState called: gmState
> GM_DESTROY_KEYPAIR_SUBMIT, condorState 1
> 06/23/11 19:37:32 [25245] (8.0) doEvaluateState called: gmState
> GM_CREATE_KEYPAIR, condorState 1
> 06/23/11 19:37:32 [25245] ERROR "Bad EC2_VM_START Request: E" at line
> 2256 in
> file /state/partition1/condor/src/condor_gridmanager/gahp-client.cpp
> 
> 
> If you can tell me where to put debug statements in the ec2_gahp
> files, I can do that.
> -P
> 
> 
> On Thu, Jun 23, 2011 at 5:36 PM, Matthew Farrellee <matt@xxxxxxxxxx>
> wrote:
>         I believe with the new ec2_gahp you need "grid_resource = ec2
>         https://ec2.amazonaws.com/";
>         
>         Best,
>         
>         
>         matt
>         
>         
>         
>         On 06/23/2011 07:30 PM, Philip Papadopoulos wrote:
>         
>                 
>                 Still no love....
>                 I git cloned the head of the condor tree, and remade
>                 copied condor_submit, condor_gridmanager, and ec2_gaph
>                 in bin, sbin, sbin
>                 
>                 I changed the condor config to use the new gahp.
>                 $ condor_config_val -dump | grep AMAZON
>                 AMAZON_GAHP = $(SBIN)/ec2_gahp
>                 AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
>                 GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_AMAZON =
>                 20
>                 
>                 And then submitted with
>                 universe = grid
>                 grid_resource = amazon https://ec2.amazonaws.com/
>                 periodic_release = NumHolds < 3
>                 +NumHolds = 0
>                 periodic_remove = NumHolds >= 3 || (JobStatus == 2 &&
>                 time()-ShadowBday
>                  > 1*60*60)
>                 executable = RunEC2VM
>                 amazon_keypair_file = keypair.$(Process)
>                 
>                 amazon_ami_id = ami-4ed12d27
>                 amazon_instance_type = m1.large
>                 amazon_user_data =
>                 condor:landphil.rocksclusters.org:40000:50000
>                 amazon_private_key = /home/phil/.ec2/pk.pem
>                 amazon_public_key = /home/phil/.ec2/cert.pem
>                 
>                 queue 1\
>                 
>                 as before.
>                 
>                 GridManager.log shows
>                 06/23/11 16:21:27 Setting maximum accepts per cycle 4.
>                 06/23/11 16:21:29 [27034]
>                 ================================>
>                 AmazonJob::AmazonJob 1
>                 06/23/11 16:21:29 [27034] Found job 2.0 --- inserting
>                 06/23/11 16:21:29 [27034] gahp server not up yet,
>                 delaying ping
>                 06/23/11 16:21:29 [27034] (2.0) doEvaluateState
>                 called: gmState GM_INIT,
>                 condorState 1
>                 06/23/11 16:21:29 [27034] GAHP server pid = 27038
>                 06/23/11 16:21:34 [27034] ERROR "Bad
>                 AMAZON_VM_STATUS_ALL Request: E" at
>                 line 2256 in file
>                 /state/partition1/condor/src/condor_gridmanager/gahp-client.cpp
>                 
>                  From this same node, I can use ec2-native tools to
>                 start stop query
>                 instances
>                 e.g
>                 $ ec2-describe-instances
>                 RESERVATION     r-ef3f0283      126101316194
>                  default
>                 INSTANCE        i-d91433b7      ami-4ed12d27
>                 ec2-50-17-131-129.compute-1.amazonaws.com
>                 
>                 <http://ec2-50-17-131-129.compute-1.amazonaws.com>
>                 
>                 ip-10-110-235-155.ec2.internal      running         0
>                 m1.large        2011-06-23T23:05:41+0000
>                  us-east-1c
>                 aki-e5c1218c       monitoring-disabled
>                  50.17.131.129
>                 10.110.235.155
>                 instance-store
>                  paravirtual
>                 xen             sg-427ca02b
>                 
>                 
>                 and
>                 
>                 ec2-terminate-instances i-d91433b7
>                 INSTANCE        i-d91433b7      running shutting-down
>                 
>                 
>                 -P
>                 
>                 
>                 
>                 
>                 
>                 
>                 
>                 On Thu, Jun 23, 2011 at 7:41 AM, Philip Papadopoulos
>                 
>                 <philip.papadopoulos@xxxxxxxxx
>                 <mailto:philip.papadopoulos@xxxxxxxxx>>
>                 
>                 wrote:
>                 
>                 
>                    I will try that when I get in this AM (I'm on the
>                 west coast) and
>                    report back.
>                    Thanks,
>                    Phil
>                 
>                    On Thu, Jun 23, 2011 at 7:34 AM, Timothy St. Clair
>                 
>                    <tstclair@xxxxxxxxxx <mailto:tstclair@xxxxxxxxxx>>
>                 wrote:
>                 
>                        You could extract the condor_submit +
>                 gridmanager + ec2_gahp..
>                 
>                        Cheers,
>                        Tim
>                 
>                        On Thu, 2011-06-23 at 07:26 -0700, Philip
>                 Papadopoulos wrote:
>                         > Do I need all of condor 7.7 or can I just
>                 extract the ec2_gahp
>                         > executable from it?
>                         >
>                         > Thanks,
>                         > Phil
>                         >
>                         >
>                         >
>                         > On Thu, Jun 23, 2011 at 4:56 AM, Matthew
>                 Farrellee
>                 
>                        <matt@xxxxxxxxxx <mailto:matt@xxxxxxxxxx>>
>                 
>                 
>                         > wrote:
>                         >
>                         >         On 06/22/2011 02:49 PM, Philip
>                 Papadopoulos wrote:
>                         >
>                         >
>                         >                 Trying out Condor 7.6.1 --
>                 installed via the
>                         >                 rhap.stripped.tar.gz
>                         >
>                         >                 I get the following in my
>                 GAHP log.
>                         >                 06/22/11 09:33:37
>                        Command(AMAZON_VM_STATUS_ALL) got
>                         >                 error(code:Client,
>                         >                 msg:End of file or no input:
>                 Operation
>                        interrupted or
>                         >                 timed out
>                         >                 06/22/11 09:38:38 Call to
>                 DescribeInstances
>                        failed:
>                         >                 SOAP 1.1 fault:
>                         >                 SOAP-ENV:Client [no subcode]
>                         > "End of file or no input: Operation
>                 interrupted or
>                         >                 timed out"
>                         >                 Detail: [no detail]
>                         >
>                         >                 06/22/11 09:38:38
>                        Command(AMAZON_VM_STATUS_ALL) got
>                         >                 error(code:Client,
>                         >                 msg:End of file or no input:
>                 Operation
>                        interrupted or
>                         >                 timed out
>                         >                 06/22/11 09:42:08 EOF
>                 reached on pipe 0
>                         >                 06/22/11 09:42:08 stdin
>                 buffer closed, exiting
>                         >                 06/22/11 09:47:19 Call to
>                 DescribeInstances
>                        failed:
>                         >                 SOAP 1.1 fault:
>                         >                 SOAP-ENV:Client [no subcode]
>                         > "End of file or no input: Operation
>                 interrupted or
>                         >                 timed out"
>                         >                 Detail: [no detail]
>                         >
>                         >                 06/22/11 09:47:19
>                        Command(AMAZON_VM_STATUS_ALL) got
>                         >                 error(code:Client,
>                         >                 msg:End of file or no input:
>                 Operation
>                        interrupted or
>                         >                 timed out
>                         >                 06/22/11 09:48:33 EOF
>                 reached on pipe 0
>                         >                 06/22/11 09:48:33 stdin
>                 buffer closed, exiting
>                         >                 06/22/11 09:49:18 Call to
>                 DescribeInstances
>                        failed:
>                         >                 SOAP 1.1 fault:
>                         >                 SOAP-ENV:Client [no subcode]
>                         > "End of file or no input: Operation
>                 interrupted or
>                         >                 timed out"
>                         >                 Detail: [no detail]
>                         >
>                         >                 06/22/11 09:49:18
>                        Command(AMAZON_VM_STATUS_ALL) got
>                         >                 error(code:Client,
>                         >                 msg:End of file or no input:
>                 Operation
>                        interrupted or
>                         >                 timed out
>                         >
>                         >
>                         >                 The submission file is
>                 simple:
>                         >                 universe = grid
>                         >                 grid_resource = amazon
>                 https://ec2.amazonaws.com/
>                         >                 periodic_release = NumHolds
>                 < 3
>                         >                 +NumHolds = 0
>                         >                 periodic_remove = NumHolds
>                 >= 3 || (JobStatus
>                        == 2 &&
>                         >                 time()-ShadowBday
>                         > > 1*60*60)
>                         >                 executable = RunEC2VM
>                         >                 amazon_keypair_file =
>                 keypair.$(Process)
>                         >
>                         >                 amazon_ami_id = ami-4ed12d27
>                         >                 amazon_instance_type =
>                 m1.large
>                         >                 amazon_user_data =
>                         >
>                 condor:landphil.rocksclusters.org:40000:50000
>                         >                 amazon_private_key
>                 = /home/phil/.ec2/pk.pem
>                         >                 amazon_public_key
>                 = /home/phil/.ec2/cert.pem
>                         >
>                         >                 queue 1
>                         >
>                         >
>                         >                 And the condor_config_val
>                  (The salient ones
>                        I think)
>                         >                 $ condor_config_val -dump |
>                 grep -i amazon
>                         >                 AMAZON_GAHP =
>                 $(SBIN)/amazon_gahp
>                         >                 AMAZON_GAHP_LOG
>                 = /tmp/AmazonGahpLog.$(USERNAME)
>                         >
>                 
>                  GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_AMAZON =
>                         >                 20
>                         >
>                         >                 and
>                         >                 $ condor_config_val -dump |
>                 grep -i ssl
>                         >                 SOAP_SSL_CA_FILE
>                 = /etc/pki/tls/cert.pem
>                         >                 SOAP_SSL_SKIP_HOST_CHECK =
>                 True
>                         >
>                         >                 I've tried both with an
>                 without
>                         >                 SOAP_SSL_SKIP_HOST_CHECK.
>                         >                 the SSL_CA_FILE exists
>                         >                 If I try WITHOUT the
>                         >                 SOAP_SSL_CA_FILE
>                 = /etc/pki/tls/cert.pem
>                         >                 then I get
>                         >                  Call to DescribeInstances
>                 failed: SOAP 1.1
>                        fault:
>                         >                 SOAP-ENV:Client [no
>                         >                 subcode]
>                         > "SSL_ERROR_SSL
>                         >                 error:14090086:SSL
>                         >
>                 routines:SSL3_GET_SERVER_CERTIFICATE:certificate
>                         >                 verify failed"
>                         >                 Detail: SSL connect failed
>                 in tcp_connect()
>                         >
>                         >
>                         >                 Right now I'm flumoxed.
>                         >
>                         >                 Thanks,
>                         >                 Phil
>                         >
>                         >                 --
>                         >                 Philip Papadopoulos, PhD
>                         >                 University of California,
>                 San Diego
>                         >
>                 
>                         > 858-822-3628 <tel:858-822-3628>
>                 <tel:858-822-3628
>                 
>                        <tel:858-822-3628>> (Ofc)
>                 
>                         > 619-331-2990 <tel:619-331-2990>
>                 <tel:619-331-2990
>                 
>                 
>                        <tel:619-331-2990>> (Fax)
>                         >
>                         >         Phil,
>                         >
>                         >         Assuming you aren't getting those
>                 errors 100% of the
>                        time, and
>                         >         you're actually talking to AWS's EC2
>                 service.
>                         >
>                         >         I've seen similar intermittent
>                 issues in the past.
>                        They came
>                         >         and went by days. After much
>                 investigation, I eventually
>                         >         chalked them up to transient issues
>                 with AWS' EC2 SOAP
>                         >         interface. The amazon_gahp was
>                 Condor's first means to
>                         >         interact with EC2 and was written to
>                 the (then
>                        popular) SOAP
>                         >         interface. Over the years the EC2
>                 Query interface has
>                         >         apparently taken hold as the
>                 interface of choice,
>                        with many
>                         >         EC2 clones not supporting SOAP. In
>                 response, the
>                        ec2_gahp has
>                         >         been written, available in 7.7,
>                 against the Query
>                        interface.
>                         >         You should try it out, especially on
>                 a day when the SOAP
>                         >         interface is failing so that we
>                 might get a better
>                        handle on
>                         >         if the issue is truly SOAP v Query.
>                         >
>                         >         Best,
>                         >
>                         >
>                         >         matt
>                         >
>                         >
>                         >
>                         > --
>                         > Philip Papadopoulos, PhD
>                         > University of California, San Diego
>                         > 858-822-3628 <tel:858-822-3628> (Ofc)
>                         > 619-331-2990 <tel:619-331-2990> (Fax)
>                 
>                         >
>                 _______________________________________________
>                         > Condor-users mailing list
>                         > To unsubscribe, send a message to
>                        condor-users-request@xxxxxxxxxxx
>                 
>                        <mailto:condor-users-request@xxxxxxxxxxx> with
>                 a
>                 
>                         > subject: Unsubscribe
>                         > You can also unsubscribe by visiting
>                         >
>                 https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>                         >
>                         > The archives can be found at:
>                         >
>                 https://lists.cs.wisc.edu/archive/condor-users/
>                 
>                        _______________________________________________
>                        Condor-users mailing list
>                        To unsubscribe, send a message to
>                        condor-users-request@xxxxxxxxxxx
>                 
>                        <mailto:condor-users-request@xxxxxxxxxxx> with
>                 a
>                 
>                        subject: Unsubscribe
>                        You can also unsubscribe by visiting
>                 
>                  https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>                 
>                        The archives can be found at:
>                        https://lists.cs.wisc.edu/archive/condor-users/
>                 
>                 
>                 
>                 
>                    --
>                    Philip Papadopoulos, PhD
>                    University of California, San Diego
>                 
>                    858-822-3628 <tel:858-822-3628> (Ofc)
>                    619-331-2990 <tel:619-331-2990> (Fax)
>                 
>                 
>                 
>                 
>                 
>                 --
>                 Philip Papadopoulos, PhD
>                 University of California, San Diego
>                 858-822-3628 (Ofc)
>                 619-331-2990 (Fax)
>                 
>         
> 
> 
> 
> -- 
> Philip Papadopoulos, PhD
> University of California, San Diego
> 858-822-3628 (Ofc)
> 619-331-2990 (Fax)

--- End Message ---