Automated Multi-Region deployments in AWS: Gotchas

"Gotcha" maybe a bit over the top, but perhaps "caveats" is a better term. Leveraging StackSets alone can cause some order of operation issues, as well as adding multi-region on top of it.

We will discuss these caveats more in depth in other articles, but wanted to touch on StackSets up front, since they underpin everything we will do.

With StackSets and applying them to OUs, automated deployment of them works like a charm, most of the time. As we laid out in the Intro, we deploy all of our IaC as StackSets into OU targets. We do this to automate deployments and ensure we have a consistent deployment throughout all of our Accounts for an application.

This also enables us to create private tenets for customers that only they access, with minimal overhead.

Our entire cloud journey is to remove overhead and reduce maintenance needs and build more awesome things.

StackSet Execution Order

Below we show an example set of StackSets that we deploy into our AWS Organization using Delegated Admin:

  • DSOP Centralized Lambda Code Buckets
    • This is deployed only to the DSOP OU. It creates an S3 bucket that any org in the organization can pull from. Lambda S3 code and Lambda Layers are deployed here
  • DSOP Centralized Kinesis
    • This is deployed only to our Logging OU. It creates a S3 bucket that Kinesis Firehose streams can write to, centralized
  • DSOP SSM Lookup
    • This is deployed to our Production OU. This enables us to lookup SSM values. This replaced the resolve functionality with SSM.
  • DSOP ACM Generator
    • This is deployed to our Production OU. This creates the DNS records necessary for our production apps to create and authorize TLS certificates
  • DSOP Dependency Analysis
    • This is deployed to our Production OU. This enables us to analyze how dependent stack sets are created and deleted, ensuring we don't try to create or delete resources
  • DSOP Parameter Lookup
    • This is deployed to our Production OU. This enables us to store configuration data in our Security account and pull that from the Production OU accounts.

Next, per application OU (In this case the Weather App in the Weather App OU), we deploy the following (generally)

  • App Infrastructure StackSet (KMS, S3, VPC, Subnets)
  • App Global Tables StackSet
  • App StepFunction StackSet
  • App Lambda Functions StackSet
  • App API Gateway StackSet

We break apart things like Lambda functions and API Gateway. We do this because we want to separate compute from infrastructure. Our desire is to have the App CodePipeline deploy as "thin" of a template as possible and keep the "blast radius" of changes restricted to as few potential items as possible.

Part of this is we can then deploy other types of compute, like Fargate with loose coupling.

So, here comes the limitations. We are only going to touch on this major one here. StackSets execute all at once and in no particular order. We can't set the order of operations to run in this order (for deployment):

  • App Infrastructure StackSet
  • App StepFunction StackSet (Depends on App Infrastructure StackSet)
  • App Lambda Functions StackSet (Depends on App Infrastructure StackSet)
  • App API Gateway StackSet (depends on App Lambda Functions StackSet and App Infrastructure StackSet)

When a new Account is added to the Weather App OU, it would run all of them at the same time. The StackSets App StepFunction StackSet, App Lambda Functions StackSet, App API Gateway StackSet all will fail because App Infrastructure StackSet is executing. Thus, the three important StackSets we need fail. Getting them to redeploy is a pain as well, there is no straight forward means to apply failed stacks.

But wait theres more

StackSet Execution order also rears its head when you remove StackSets. Let's say your Weather App OU has 2 accounts in it. You decide to decommission an account and remove it from the OU. When this occurs, if you scope StackSets to that OU for automatic deployment and enforce deletion on removal, it will remove them from the account.

So, now you have the problem above in the reverse order. How do you delete AWS Lambda functions defined in your Function StackSet before you delete your Infrastructure StackSet that defines the Sub Groups and VPCs?

Well, again there is no easy way to do this.

Tags to the rescue

So, we thought awhile about how to solve this. We iterated on three different solutions and believe we have one that doesn't hard code dependencies on core templates.

To solve this problem, we've developed yet another AWS CloudFormation Custom Resource. This one tracks dependencies that CloudFormation Templates may have on one another. We do this by attaching tag data to the StackSet Instances themselves.

Thus, we can indicate that our lambda function relies on our infrastructure template and our api-gateway template depends on our lambda template.

When we start to tie this together, we can enforce the following order:

Deploy these first:

  • App Infrastructure StackSet (KMS, S3, VPC, Subnets)
  • App Global Tables StackSet

Then after the infrastructure templates deploy, the dependency checks pass for the following:

  • App StepFunction StackSet
  • App Lambda Functions StackSet

Finally, after the App Lambda Functions StackSet deploys, we can deploy:

  • App API Gateway StackSet

All of this occurs without us having to embed logic to monitor dependencies between StackSets.

We mandated each StackSet apply the following tags:

Tags: 
  - Key: "dsop:stackset:dependson"
    Value: !Sub "${parStackSetApplicationName}-infrastructure"
  - Key: "dsop:stackset:name"
    Value: !Sub "${parStackSetApplicationName}-lambda"
  - Key: "dsop:stackset:application"
    Value: !Ref parStackSetApplicationName

These tags give context to our templates that enable our Custom Resource to function. Adding these two resources into our deployed template:

# Waits for stacks depending on this stack to be cleaned up
WaitForDependencyCleanup:
	Type: Custom::WaitForDependencyCleanup
	DependsOn: 
		- AppLambdaIAMRole
		- WebAppApiAppDotNetCoreFunction
	Version: '1.0'
	Properties:
		ServiceToken: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:dsop-cloudformation-dependency-analysis"
		StackId: !Ref "AWS::StackId"
		Type: Delete

# Waits for stacks depending on this stack to be created
WaitForDependencyCreation:
	Type: Custom::WaitForDependencyCreation
	Version: '1.0'
	Properties:
		ServiceToken: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:dsop-cloudformation-dependency-analysis"
		StackId: !Ref "AWS::StackId"
		Type: Create
		Arn: 
			- !Sub "arn:${AWS::Partition}:kms:${parGlobalTableRegion2}:${AWS::AccountId}:alias/App/dynamodb-global"
			- !Sub "arn:${AWS::Partition}:kms:${AWS::Region}:${AWS::AccountId}:alias/App/dynamodb-global"

Performs all the dependency analysis we need.

Now, as our StackSets begin to cleanup, they will be forced to wait for their dependencies to clean themselves up.

We do suggest after decommissioning an account to provide some additional cleanup. For instance, you may have S3 buckets with data in them still that you don't mark for automated deletion. Some of this probably could be accomplished with a bespoke set of StepFunctions, but that could impose some risks as well. Finally, the removed accounts could have templates that didn't finish deleting, a bit of cleanup must be performed.

Other Gotchas

There are other gotchas out there, which we will touch on in the follow on articles below. Things like DynamoDB Global Tables, S3 bucket replication, Lambda code locations, etc.

Finally, StackSets have a limit of 100 per admin account. So, this strategy may incur some limits depending how big your solution stack and application stack is going to grow. If you have 10 apps that have 10 StackSet templates, thats going to start go create some issues for you.

Shout Out

Big shout out to George for helping me sort one issue on deploying CloudFormation StackSets in CodePipeline to get over a last hurtle! We'll touch on his help in Part 7.

Next Up in series

Next up in this series will be:

  • Part 1: Intro
  • Part 2: Gotchas
  • Part 3: DynamoDB Global Tables
  • Part 4: AWS Lambda (Pending)
  • Part 5: S3 Replication (Pending)
  • Part 6: AWS Fargate (Pending)
  • Part 7: AWS CodePipeline (Pending)