[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Unable to start EC2 instance



On 06/22/2011 02:49 PM, Philip Papadopoulos wrote:
Trying out Condor 7.6.1 -- installed via the rhap.stripped.tar.gz

I get the following in my GAHP log.
06/22/11 09:33:37 Command(AMAZON_VM_STATUS_ALL) got error(code:Client,
msg:End of file or no input: Operation interrupted or timed out
06/22/11 09:38:38 Call to DescribeInstances failed: SOAP 1.1 fault:
SOAP-ENV:Client [no subcode]
"End of file or no input: Operation interrupted or timed out"
Detail: [no detail]

06/22/11 09:38:38 Command(AMAZON_VM_STATUS_ALL) got error(code:Client,
msg:End of file or no input: Operation interrupted or timed out
06/22/11 09:42:08 EOF reached on pipe 0
06/22/11 09:42:08 stdin buffer closed, exiting
06/22/11 09:47:19 Call to DescribeInstances failed: SOAP 1.1 fault:
SOAP-ENV:Client [no subcode]
"End of file or no input: Operation interrupted or timed out"
Detail: [no detail]

06/22/11 09:47:19 Command(AMAZON_VM_STATUS_ALL) got error(code:Client,
msg:End of file or no input: Operation interrupted or timed out
06/22/11 09:48:33 EOF reached on pipe 0
06/22/11 09:48:33 stdin buffer closed, exiting
06/22/11 09:49:18 Call to DescribeInstances failed: SOAP 1.1 fault:
SOAP-ENV:Client [no subcode]
"End of file or no input: Operation interrupted or timed out"
Detail: [no detail]

06/22/11 09:49:18 Command(AMAZON_VM_STATUS_ALL) got error(code:Client,
msg:End of file or no input: Operation interrupted or timed out


The submission file is simple:
universe = grid
grid_resource = amazon https://ec2.amazonaws.com/
periodic_release = NumHolds < 3
+NumHolds = 0
periodic_remove = NumHolds >= 3 || (JobStatus == 2 && time()-ShadowBday
 > 1*60*60)
executable = RunEC2VM
amazon_keypair_file = keypair.$(Process)

amazon_ami_id = ami-4ed12d27
amazon_instance_type = m1.large
amazon_user_data = condor:landphil.rocksclusters.org:40000:50000
amazon_private_key = /home/phil/.ec2/pk.pem
amazon_public_key = /home/phil/.ec2/cert.pem

queue 1


And the condor_config_val  (The salient ones I think)
$ condor_config_val -dump | grep -i amazon
AMAZON_GAHP = $(SBIN)/amazon_gahp
AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_AMAZON = 20

and
$ condor_config_val -dump | grep -i ssl
SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
SOAP_SSL_SKIP_HOST_CHECK = True

I've tried both with an without SOAP_SSL_SKIP_HOST_CHECK.
the SSL_CA_FILE exists
If I try WITHOUT the
SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
then I get
  Call to DescribeInstances failed: SOAP 1.1 fault: SOAP-ENV:Client [no
subcode]
"SSL_ERROR_SSL
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate
verify failed"
Detail: SSL connect failed in tcp_connect()


Right now I'm flumoxed.

Thanks,
Phil

--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 <tel:858-822-3628> (Ofc)
619-331-2990 <tel:619-331-2990> (Fax)

Phil,

Assuming you aren't getting those errors 100% of the time, and you're actually talking to AWS's EC2 service.

I've seen similar intermittent issues in the past. They came and went by days. After much investigation, I eventually chalked them up to transient issues with AWS' EC2 SOAP interface. The amazon_gahp was Condor's first means to interact with EC2 and was written to the (then popular) SOAP interface. Over the years the EC2 Query interface has apparently taken hold as the interface of choice, with many EC2 clones not supporting SOAP. In response, the ec2_gahp has been written, available in 7.7, against the Query interface. You should try it out, especially on a day when the SOAP interface is failing so that we might get a better handle on if the issue is truly SOAP v Query.

Best,


matt