[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor-Annex setup and check-setup reports OK as long as CloudFormation entries exist.



Hello all,

I just wanted to report on a weird condor annex behaviour that has been tripping me since Friday-ish. The TLDR of the report is that Condor Annex remembers configuration state as long as CloudFormation exists. If one wants to fix an annex miss-configuration problem, or just experiment with tweaking Annex behaviour by starting from scratch, one needs to delete the config buckets, the lambda functions and the CloudFormation entries.
The problem is that `condor_annex -check-status` will report OK - even when the actual config buckets and lambda's don't exist but the CloufFormation does. Not sure if it's a bug or intended behaviour, but it's definitely confusing.

To illustrate why it's confusing, consider the following (if you understood what the above says you don't need to read the following):

The annex config files exist and have valid data in them. User config contains just the `SEC_PASSWORD_FILE` and `ANNEX_DEFAULT_AWS_REGION`
variables and is, at this step, empty:

>>> (lsst-scipipe-0.4.1) [centos@ip-172-31-48-210 ~]$ ls -al .condor/
>>> total 20
>>> drwxrwxr-x. Â2 centos centos  96 Apr 12 21:33 .
>>> drwx------. 13 centos centos 4096 Apr 12 20:29 ..
>>> -rw-------. Â1 centos centos  30 Apr 12 20:02 condor_pool_password
>>> -rw-------. Â1 centos centos  40 Apr 12 21:34 privateKeyFile
>>> -rw-------. Â1 centos centos  20 Apr 12 21:34 publicKeyFile
>>> -rw-rw-r--. Â1 centos centos  95 Apr 12 21:33 user_config

So let's set up:

>>> (lsst-scipipe-0.4.1) [centos@ip-172-31-48-210 ~]$ condor_annex -setup
>>> Creating configuration bucket (this takes less than a minute).. complete.
>>> Creating Lambda functions (this takes about a minute).. complete.
>>> Creating instance profile (this takes about two minutes).. complete.
>>> Creating security group (this takes less than a minute).. complete.
>>> Setup successful.

Let's verify the setup is correct

>>> (lsst-scipipe-0.4.1) [centos@ip-172-31-48-210 ~]$ condor_annex -check-setup
>>> Checking security configuration... OK.
>>> Checking for configuration bucket... OK.
>>> Checking for Lambda functions... OK.
>>> Checking for instance profile... OK.
>>> Checking for security group... OK.
>>> Your setup looks OK.

Wow, so smooth! Or is it?

>>> (lsst-scipipe-0.4.1) [centos@ip-172-31-48-210 ~]$ condor_annex -annex-name test -count 1
>>> Will request 1 m4.large on-demand instance for 0.83 hours. Each instance will terminate after being idle for 0.25 hours.
>>> Is that OK? Â(Type 'yes' or 'no'): yes
>>> Starting annex...
>>> Failed to check connectivity: 'E_HTTP_RESPONSE_NOT_200 (404)' (1): '{"Message":"Function not found: arn:aws:lambda:us-west-2:SCRUB_ACC_ID:function:HTCondorAnnex-CheckConnectivity","Type":"User"}'.

None of the created setup Lambda functions or S3 Buckets actually exist.

>>> (lsst-scipipe-0.4.1) [centos@ip-172-31-48-210 ~]$ cat .condor/user_config
>>> SEC_PASSWORD_FILE=/home/centos/.condor/condor_pool_password
>>> ANNEX_DEFAULT_AWS_REGION=us-west-2
>>>
>>> # Generated by condor-annex -setup for region us_west_2.
>>> us_west_2.ANNEX_DEFAULT_ACCESS_KEY_FILE = /home/centos/.condor/publicKeyFile
>>> us_west_2.ANNEX_DEFAULT_ODI_INSTANCE_PROFILE_ARN = arn:aws:iam::SCRUB_ACC_ID:instance-profile/HTCondorAnnex-InstanceProfile-InstanceConfigurationProfile-2T7H88JQRE6H
>>> us_west_2.ANNEX_DEFAULT_ODI_KEY_NAME = HTCondorAnnex-KeyPair
>>> us_west_2.ANNEX_DEFAULT_S3_BUCKET = htcondorannex-configurationbu-configurationbucket-pd6zcveite43
>>> us_west_2.ANNEX_DEFAULT_SECRET_KEY_FILE = /home/centos/.condor/privateKeyFile
>>> us_west_2.ANNEX_DEFAULT_ODI_SECURITY_GROUP_IDS = sg-02178feb804d8007b
>>> us_west_2.ANNEX_DEFAULT_ODI_LEASE_FUNCTION_ARN = arn:aws:lambda:us-west-2:SCRUB_ACC_ID:function:HTCondorAnnex-LambdaFunctions-odiLeaseFunction-AIJOPN5UFOSC
>>> us_west_2.ANNEX_DEFAULT_SFR_LEASE_FUNCTION_ARN = arn:aws:lambda:us-west-2:SCRUB_ACC_ID:function:HTCondorAnnex-LambdaFunctions-sfrLeaseFunction-1U1F6YMQSPKBU

Both journalctl logs are clean for condor and condor-annex-ec2. There's nothing seemingly in HTCondors Master logs that would indicate
collector keeled over or some other error. The AnnexGahp logs are full of messages similar to:

>>> 04/12/21 22:04:32 got stdin: 'CF_CREATE_STACK 2 https://cloudformation.us-west-2.amazonaws.com /home/centos/.condor/publicKeyFile /home/centos/.condor/privateKeyFile HTCondorAnnex-ConfigurationBucket https://s3\
>>> .amazonaws.com/condor-annex/bucket-9.json CAPABILITY_IAM NULL'
>>> 04/12/21 22:04:32 Sending CF_CREATE_STACK 2 https://cloudformation.us-west-2.amazonaws.com /home/centos/.condor/publicKeyFile /home/centos/.condor/privateKeyFile HTCondorAnnex-ConfigurationBucket https://s3.ama\
>>> zonaws.com/condor-annex/bucket-9.json CAPABILITY_IAM NULL to worker 1
>>> 04/12/21 22:04:32 Request URI is 'https://cloudformation.us-west-2.amazonaws.com'
>>> 04/12/21 22:04:32 Payload is 'Action="" href="">2Fs3.amazonaws.com%2Fcondor-annex%2Fbuc\
>>> ket-9.json&Version=2010-05-15'
>>> 04/12/21 22:04:32 Query did not return 200 (400), failing.
>>> 04/12/21 22:04:32 Failure response text was '<ErrorResponse xmlns="http://cloudformation.amazonaws.com/doc/2010-05-15/">
>>> Â <Error>
>>> Â Â <Type>Sender</Type>
>>> Â Â <Code>AlreadyExistsException</Code>
>>> Â Â <Message>Stack [HTCondorAnnex-ConfigurationBucket] already exists</Message>
>>> Â </Error>
>>> Â <RequestId>a2e7e383-976b-4571-8717-f635dc9b84d8</RequestId>
>>> </ErrorResponse>
>>> '.

or

>>> 04/12/21 22:04:32 got stdin: 'CF_CREATE_STACK 4 https://cloudformation.us-west-2.amazonaws.com /home/centos/.condor/publicKeyFile /home/centos/.condor/privateKeyFile HTCondorAnnex-LambdaFunctions https://s3.ama\
>>> zonaws.com/condor-annex/template-9.json CAPABILITY_IAM S3BucketName htcondorannex-configurationbu-configurationbucket-pd6zcveite43 NULL'
>>> 04/12/21 22:04:32 Sending CF_CREATE_STACK 4 https://cloudformation.us-west-2.amazonaws.com /home/centos/.condor/publicKeyFile /home/centos/.condor/privateKeyFile HTCondorAnnex-LambdaFunctions https://s3.amazona\
>>> ws.com/condor-annex/template-9.json CAPABILITY_IAM S3BucketName htcondorannex-configurationbu-configurationbucket-pd6zcveite43 NULL to worker 1
>>> 04/12/21 22:04:32 Request URI is 'https://cloudformation.us-west-2.amazonaws.com'
>>> 04/12/21 22:04:32 Payload is 'Action="">>>> nbu-configurationbucket-pd6zcveite43&StackName=HTCondorAnnex-LambdaFunctions&TemplateURL=https%3A%2F%2Fs3.amazonaws.com%2Fcondor-annex%2Ftemplate-9.json&Version=2010-05-15'
>>> 04/12/21 22:04:32 Query did not return 200 (400), failing.
>>> 04/12/21 22:04:32 Failure response text was '<ErrorResponse xmlns="http://cloudformation.amazonaws.com/doc/2010-05-15/">
>>> Â <Error>
>>> Â Â <Type>Sender</Type>
>>> Â Â <Code>AlreadyExistsException</Code>
>>> Â Â <Message>Stack [HTCondorAnnex-LambdaFunctions] already exists</Message>
>>> Â </Error>
>>> Â <RequestId>e57971e0-9e9a-4dd7-8792-d6244f25cbd4</RequestId>
>>> </ErrorResponse>
>>> '.
>>> 04/12/21 22:04:32 CMD("CF_CREATE_STACK 4 https://cloudformation.us-west-2.amazonaws.com /home/centos/.condor/publicKeyFile /home/centos/.condor/privateKeyFile HTCondorAnnex-LambdaFunctions https://s3.amazonaws.\
>>> com/condor-annex/template-9.json CAPABILITY_IAM S3BucketName htcondorannex-configurationbu-configurationbucket-pd6zcveite43 NULL") is done with result 4 1 E_HTTP_RESPONSE_NOT_200\ (400) <ErrorResponse\ xmlns="h\
>>> ttp://cloudformation.amazonaws.com/doc/2010-05-15/">\
>>> \ \ <Error>\
>>> \ \ \ \ <Type>Sender</Type>\
>>> \ \ \ \ <Code>AlreadyExistsException</Code>\
>>> \ \ \ \ <Message>Stack\ [HTCondorAnnex-LambdaFunctions]\ already\ exists</Message>\
>>> \ \ </Error>\
>>> \ \ <RequestId>e57971e0-9e9a-4dd7-8792-d6244f25cbd4</RequestId>\
>>> </ErrorResponse>\

which is confusing at least - especially because I can explicitly see these things do not actually exist:

(lsst-scipipe-0.4.1) [centos@ip-172-31-48-210 .condor]$ aws lambda list-functions | grep FunctionName
      "FunctionName": "scrubbed_name_of_function_that_exists_but_isn't_condor",

(lsst-scipipe-0.4.1) [centos@ip-172-31-48-210 .condor]$ aws s3api list-buckets | grep Name
      "Name":
      "Name":
      "Name":
      "Name":
      "Name":
      "Name":
      "Name":
      "Name":
      "Name":     Âbunch of bucket names that are sanitized
      "Name":     Âbecause they are not condor....
      "Name":
      "Name":
      "Name":
      "Name":
      "Name":
      "Name":
      "Name":
      "Name":
      "Name":
    "DisplayName": "dirac",

Don't know if this is a bug or not, but I wanted to point this out in case someone else encounters something similar.
Sorry for the long email.

Regards,
D.