[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Unable to start EC2 instance



Closer, but not quite there.

[root@vizagra ~]# tail -f /var/opt/condor/log/GridmanagerLog.phil
06/23/11 19:37:22 [25245] Found job 8.0 --- inserting
06/23/11 19:37:22 [25245] gahp server not up yet, delaying ping
06/23/11 19:37:22 [25245] (8.0) doEvaluateState called: gmState GM_INIT, condorState 1
06/23/11 19:37:22 [25245] GAHP server pid = 25247
06/23/11 19:37:28 [25245] resource https://ec2.amazonaws.com/ is now up
06/23/11 19:37:28 [25245] (8.0) doEvaluateState called: gmState GM_CHECK_VM, condorState 1
06/23/11 19:37:28 [25245] (8.0) doEvaluateState called: gmState GM_CHECK_VM, condorState 1
06/23/11 19:37:29 [25245] (8.0) doEvaluateState called: gmState GM_DESTROY_KEYPAIR_SUBMIT, condorState 1
06/23/11 19:37:32 [25245] (8.0) doEvaluateState called: gmState GM_CREATE_KEYPAIR, condorState 1
06/23/11 19:37:32 [25245] ERROR "Bad EC2_VM_START Request: E" at line 2256 in file /state/partition1/condor/src/condor_gridmanager/gahp-client.cpp


If you can tell me where to put debug statements in the ec2_gahp files, I can do that.
-P


On Thu, Jun 23, 2011 at 5:36 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
I believe with the new ec2_gahp you need "grid_resource = ec2 https://ec2.amazonaws.com/"

Best,


matt


On 06/23/2011 07:30 PM, Philip Papadopoulos wrote:
Still no love....
I git cloned the head of the condor tree, and remade
copied condor_submit, condor_gridmanager, and ec2_gaph in bin, sbin, sbin

I changed the condor config to use the new gahp.
$ condor_config_val -dump | grep AMAZON
AMAZON_GAHP = $(SBIN)/ec2_gahp
AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_AMAZON = 20

And then submitted with
universe = grid
grid_resource = amazon https://ec2.amazonaws.com/
periodic_release = NumHolds < 3
+NumHolds = 0
periodic_remove = NumHolds >= 3 || (JobStatus == 2 && time()-ShadowBday
 > 1*60*60)
executable = RunEC2VM
amazon_keypair_file = keypair.$(Process)

amazon_ami_id = ami-4ed12d27
amazon_instance_type = m1.large
amazon_user_data = condor:landphil.rocksclusters.org:40000:50000
amazon_private_key = /home/phil/.ec2/pk.pem
amazon_public_key = /home/phil/.ec2/cert.pem

queue 1\

as before.

GridManager.log shows
06/23/11 16:21:27 Setting maximum accepts per cycle 4.
06/23/11 16:21:29 [27034] ================================>
AmazonJob::AmazonJob 1
06/23/11 16:21:29 [27034] Found job 2.0 --- inserting
06/23/11 16:21:29 [27034] gahp server not up yet, delaying ping
06/23/11 16:21:29 [27034] (2.0) doEvaluateState called: gmState GM_INIT,
condorState 1
06/23/11 16:21:29 [27034] GAHP server pid = 27038
06/23/11 16:21:34 [27034] ERROR "Bad AMAZON_VM_STATUS_ALL Request: E" at
line 2256 in file
/state/partition1/condor/src/condor_gridmanager/gahp-client.cpp

 From this same node, I can use ec2-native tools to start stop query
instances
e.g
$ ec2-describe-instances
RESERVATION     r-ef3f0283      126101316194    default
INSTANCE        i-d91433b7      ami-4ed12d27
ec2-50-17-131-129.compute-1.amazonaws.com
<http://ec2-50-17-131-129.compute-1.amazonaws.com>

ip-10-110-235-155.ec2.internal      running         0
m1.large        2011-06-23T23:05:41+0000        us-east-1c
aki-e5c1218c       monitoring-disabled      50.17.131.129
10.110.235.155
instance-store                                  paravirtual
xen             sg-427ca02b


and

ec2-terminate-instances i-d91433b7
INSTANCE        i-d91433b7      running shutting-down


-P







On Thu, Jun 23, 2011 at 7:41 AM, Philip Papadopoulos
<philip.papadopoulos@xxxxxxxxx <mailto:philip.papadopoulos@gmail.com>>

wrote:


   I will try that when I get in this AM (I'm on the west coast) and
   report back.
   Thanks,
   Phil

   On Thu, Jun 23, 2011 at 7:34 AM, Timothy St. Clair
   <tstclair@xxxxxxxxxx <mailto:tstclair@xxxxxxxxxx>> wrote:

       You could extract the condor_submit + gridmanager + ec2_gahp..

       Cheers,
       Tim

       On Thu, 2011-06-23 at 07:26 -0700, Philip Papadopoulos wrote:
        > Do I need all of condor 7.7 or can I just extract the ec2_gahp
        > executable from it?
        >
        > Thanks,
        > Phil
        >
        >
        >
        > On Thu, Jun 23, 2011 at 4:56 AM, Matthew Farrellee
       <matt@xxxxxxxxxx <mailto:matt@xxxxxxxxxx>>

        > wrote:
        >
        >         On 06/22/2011 02:49 PM, Philip Papadopoulos wrote:
        >
        >
        >                 Trying out Condor 7.6.1 -- installed via the
        >                 rhap.stripped.tar.gz
        >
        >                 I get the following in my GAHP log.
        >                 06/22/11 09:33:37
       Command(AMAZON_VM_STATUS_ALL) got
        >                 error(code:Client,
        >                 msg:End of file or no input: Operation
       interrupted or
        >                 timed out
        >                 06/22/11 09:38:38 Call to DescribeInstances
       failed:
        >                 SOAP 1.1 fault:
        >                 SOAP-ENV:Client [no subcode]
        > "End of file or no input: Operation interrupted or
        >                 timed out"
        >                 Detail: [no detail]
        >
        >                 06/22/11 09:38:38
       Command(AMAZON_VM_STATUS_ALL) got
        >                 error(code:Client,
        >                 msg:End of file or no input: Operation
       interrupted or
        >                 timed out
        >                 06/22/11 09:42:08 EOF reached on pipe 0
        >                 06/22/11 09:42:08 stdin buffer closed, exiting
        >                 06/22/11 09:47:19 Call to DescribeInstances
       failed:
        >                 SOAP 1.1 fault:
        >                 SOAP-ENV:Client [no subcode]
        > "End of file or no input: Operation interrupted or
        >                 timed out"
        >                 Detail: [no detail]
        >
        >                 06/22/11 09:47:19
       Command(AMAZON_VM_STATUS_ALL) got
        >                 error(code:Client,
        >                 msg:End of file or no input: Operation
       interrupted or
        >                 timed out
        >                 06/22/11 09:48:33 EOF reached on pipe 0
        >                 06/22/11 09:48:33 stdin buffer closed, exiting
        >                 06/22/11 09:49:18 Call to DescribeInstances
       failed:
        >                 SOAP 1.1 fault:
        >                 SOAP-ENV:Client [no subcode]
        > "End of file or no input: Operation interrupted or
        >                 timed out"
        >                 Detail: [no detail]
        >
        >                 06/22/11 09:49:18
       Command(AMAZON_VM_STATUS_ALL) got
        >                 error(code:Client,
        >                 msg:End of file or no input: Operation
       interrupted or
        >                 timed out
        >
        >
        >                 The submission file is simple:
        >                 universe = grid
        >                 grid_resource = amazon https://ec2.amazonaws.com/
        >                 periodic_release = NumHolds < 3
        >                 +NumHolds = 0
        >                 periodic_remove = NumHolds >= 3 || (JobStatus
       == 2 &&
        >                 time()-ShadowBday
        > > 1*60*60)
        >                 executable = RunEC2VM
        >                 amazon_keypair_file = keypair.$(Process)
        >
        >                 amazon_ami_id = ami-4ed12d27
        >                 amazon_instance_type = m1.large
        >                 amazon_user_data =
        >                 condor:landphil.rocksclusters.org:40000:50000
        >                 amazon_private_key = /home/phil/.ec2/pk.pem
        >                 amazon_public_key = /home/phil/.ec2/cert.pem
        >
        >                 queue 1
        >
        >
        >                 And the condor_config_val  (The salient ones
       I think)
        >                 $ condor_config_val -dump | grep -i amazon
        >                 AMAZON_GAHP = $(SBIN)/amazon_gahp
        >                 AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
        >
       GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_AMAZON =
        >                 20
        >
        >                 and
        >                 $ condor_config_val -dump | grep -i ssl
        >                 SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
        >                 SOAP_SSL_SKIP_HOST_CHECK = True
        >
        >                 I've tried both with an without
        >                 SOAP_SSL_SKIP_HOST_CHECK.
        >                 the SSL_CA_FILE exists
        >                 If I try WITHOUT the
        >                 SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
        >                 then I get
        >                  Call to DescribeInstances failed: SOAP 1.1
       fault:
        >                 SOAP-ENV:Client [no
        >                 subcode]
        > "SSL_ERROR_SSL
        >                 error:14090086:SSL
        >                 routines:SSL3_GET_SERVER_CERTIFICATE:certificate
        >                 verify failed"
        >                 Detail: SSL connect failed in tcp_connect()
        >
        >
        >                 Right now I'm flumoxed.
        >
        >                 Thanks,
        >                 Phil
        >
        >                 --
        >                 Philip Papadopoulos, PhD
        >                 University of California, San Diego
        >
        > 858-822-3628 <tel:858-822-3628> <tel:858-822-3628

       <tel:858-822-3628>> (Ofc)
        > 619-331-2990 <tel:619-331-2990> <tel:619-331-2990

       <tel:619-331-2990>> (Fax)
        >
        >         Phil,
        >
        >         Assuming you aren't getting those errors 100% of the
       time, and
        >         you're actually talking to AWS's EC2 service.
        >
        >         I've seen similar intermittent issues in the past.
       They came
        >         and went by days. After much investigation, I eventually
        >         chalked them up to transient issues with AWS' EC2 SOAP
        >         interface. The amazon_gahp was Condor's first means to
        >         interact with EC2 and was written to the (then
       popular) SOAP
        >         interface. Over the years the EC2 Query interface has
        >         apparently taken hold as the interface of choice,
       with many
        >         EC2 clones not supporting SOAP. In response, the
       ec2_gahp has
        >         been written, available in 7.7, against the Query
       interface.
        >         You should try it out, especially on a day when the SOAP
        >         interface is failing so that we might get a better
       handle on
        >         if the issue is truly SOAP v Query.
        >
        >         Best,
        >
        >
        >         matt
        >
        >
        >
        > --
        > Philip Papadopoulos, PhD
        > University of California, San Diego
        > 858-822-3628 <tel:858-822-3628> (Ofc)
        > 619-331-2990 <tel:619-331-2990> (Fax)
        > _______________________________________________
        > Condor-users mailing list
        > To unsubscribe, send a message to
       condor-users-request@xxxxxxxxedu
       <mailto:condor-users-request@cs.wisc.edu> with a

        > subject: Unsubscribe
        > You can also unsubscribe by visiting
        > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
        >
        > The archives can be found at:
        > https://lists.cs.wisc.edu/archive/condor-users/

       _______________________________________________
       Condor-users mailing list
       To unsubscribe, send a message to
       condor-users-request@xxxxxxxxedu
       <mailto:condor-users-request@cs.wisc.edu> with a

       subject: Unsubscribe
       You can also unsubscribe by visiting
       https://lists.cs.wisc.edu/mailman/listinfo/condor-users

       The archives can be found at:
       https://lists.cs.wisc.edu/archive/condor-users/




   --
   Philip Papadopoulos, PhD
   University of California, San Diego
   858-822-3628 <tel:858-822-3628> (Ofc)
   619-331-2990 <tel:619-331-2990> (Fax)




--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)




--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)