Orchestrating Zero Downtime Updates with Terraform

By Kyle Persohn

September 2, 2022

Terraform is no doubt a popular go-to for developers looking to manage their infrastructure as code. It’s powered by HCL, a declarative configuration language from HashiCorp, that in many cases offers a powerful way to deploy infrastructure where the author can just focus on the “what” (desired state) and let the tool figure out the “how” to reconcile changes needed in the live environment. Sometimes though, operations personnel require more precise control over how changes get applied in order to minimize impact to live systems. In this article we’ll explore ways to model quasi-imperative workflows with declarative HCL.

Classic Blue/Green

Terraform’s default behavior is to destroy all resources that require replacement, then proceed to create new ones in their place. This is advantageous in many cases because often times resources must be uniquely named so by destroying prior to creation we avoid naming conflicts between outgoing and incoming resources. This approach isn’t so great, however, if we want to avoid downtime for infrastructure components continuously serving clients.

A common approach to avoiding downtime is the so-called Blue/Green Deployment: In essence, we stand up a complete copy of the resources while retaining the existing infrastructure and only cut over once the new ones are fully available and ready to do work. Blue/Green has the advantage of simplified and expedient rollback since a backout is easy to facilitate by switching back to the old set of resources so long as we’ve caught the issue before they are torn down.

We can orchestrate a Blue/Green deployment with Terraform by implementing the HCL outlined below. According to Paul Hinze, formerly of HashiCorp, this is even how the creators of Terraform handle production deployments themselves.

resource "aws_launch_configuration" "bluegreen" {
  name_prefix     = "bluegreen-"
  image_id        = data.aws_ami.amzn2.id
  instance_type   = "t3.nano"
  key_name        = local.key_name
  security_groups = [aws_security_group.instance.id]
  user_data       = filebase64("${path.module}/user-data.sh")

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_autoscaling_group" "bluegreen" {
  name                      = aws_launch_configuration.bluegreen.name
  desired_capacity          = local.num_azs
  health_check_grace_period = 60
  health_check_type         = "ELB"
  launch_configuration      = aws_launch_configuration.bluegreen.name
  min_size                  = local.num_azs
  max_size                  = local.num_azs
  target_group_arns         = [aws_lb_target_group.http.arn]
  vpc_zone_identifier       = aws_subnet.public.*.id
  wait_for_elb_capacity     = local.num_azs

  lifecycle {
    create_before_destroy = true
  }
}

These are the key elements of a Blue/Green deployment in Terraform:

Both the ASG and its associated launch configuration implement the create_before_destroy lifecycle meta-argument. This ensures the new resources are created prior to destroying the old, counter to Terraform’s default behavior discussed previously.
The name_prefix property of the launch configuration (vs. simply name) autogenerates a timestamp-based suffix to avoid collisions between the incoming and outgoing resources.
The ASG inherits its name from the launch configuration so any changes (such as a new AMI) implicitly trigger creation of a new ASG as well. Without this, LC updates do not immediately propagate to the underlying EC2 instances.
wait_for_elb_capacity leverages the load balancer’s health checks to block Terraform from destroying the old resources prior to the new ones fully starting and registering as InService with the load balancer to receive traffic.

If something goes wrong and the new instances never come into service because they fail their health checks, Terraform simply times out and leaves the old infrastructure in place with no disruption to end users.

Blue/Green deployments are great when you can fully duplicate a copy of your infrastructure during the deployment. But what if you can’t afford to do that or have other constraints limiting the max number of instances?

Rolling Updates

Instead of replacing all instances at once with full copies, we can alternatively swap out a subset incrementally. This is commonly referred to as a Rolling Deployment. With a Rolling deployment, there is no need to scale up 2x resources like with Blue/Green so it works well for environments with resource constraints. I’ve also run across vendor applications with strict licensing restrictions that didn’t allow bursting beyond the host limit during deployment so we had to orchestrate a rolling update starting with taking an instance out of service as to never exceed the license count.

The container ecosystem has very mature support for rolling updates, such as those in Kubernetes for example. In the IaaS world though, the paradigm is less than first-class. Notably, Terraform isn’t capable of leveraging the UpdatePolicy attribute of AWS Autoscaling supported by its cousin CloudFormation.

For many years, the leading option was to wrap a CloudFormation stack with Terraform to get the best of both tools. Check out @endofcake’s detailed write up on this approach if you’re curious.

Wrapping CloudFormation with Terraform is not the best user experience, however. Often times, the CloudFormation stack fails and gets out of sync with Terraform’s state making for an unpleasant reconciliation journey. As a user that prefers to drive Terraform from a pipeline as much as possible, I found this solution particularly disappointing because of how often manual intervention is needed to repair the CloudFormation stacks from the AWS console.

Launch Templates with Instance Refresh

Instance Refresh is AWS’s answer to customer complaints that Auto Scaling Groups hadn’t had an easy way to propagate configuration changes to EC2 instances. As of v3.22 of the AWS Provider, we can leverage this capability natively from Terraform as well without any dependency on CloudFormation or its idiosyncrasies.

Firstly, if you’re still using Launch Configurations as shown in the earlier example, you’ll want to migrate to a Launch Template instead. Unlike Launch Configurations, Launch Templates are mutable and versioned so we apply changes directly to the same resource in-place. AWS recommends Launch Templates for all new development since new features are not being backported.

Here’s our previous example refactored to use a Launch Template with Instance Refresh:

resource "aws_launch_template" "refresh" {
  name                   = local.env_name
  image_id               = data.aws_ami.amzn2.id
  instance_type          = "t3.nano"
  key_name               = local.key_name
  user_data              = filebase64("${path.module}/user-data.sh")
  vpc_security_group_ids = [aws_security_group.instance.id]

  tag_specifications {
    resource_type = "instance"
    tags = {
      Name = "${local.env_name}-asg-instance"
    }
  }
}

resource "aws_autoscaling_group" "refresh" {
  name                      = local.env_name
  desired_capacity          = local.num_azs
  health_check_grace_period = 120
  health_check_type         = "ELB"
  max_size                  = local.num_azs
  min_size                  = local.num_azs
  target_group_arns         = [aws_lb_target_group.http.arn]
  vpc_zone_identifier       = aws_subnet.public.*.id
  wait_for_elb_capacity     = local.num_azs

  launch_template {
    id      = aws_launch_template.refresh.id
    version = aws_launch_template.refresh.latest_version
  }


  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 66
    }
    triggers = ["tag"]
  }
}

Note that the launch_template block references Terraform’s latest_version property of the template resource. Make sure not to use AWS’s internal $Latest alias; otherwise, Terraform will not have awareness to cascade changes when the version number increases.

Use the instance_refresh block to customize the refresh behavior as desired. By default, a refresh automatically triggers on changes to certain ASG properties, such as the launch template version. It’s also possible to specify additional triggers, like propagating changes to tags in the example above. You can even orchestrate progressive rollouts with the checkpoint parameters to implement a Canary Release.

Go ahead and test this out by applying a simple change to the AMI ID in the Launch Template. On the Instance refresh tab of the ASG console you should see a new entry appear and a short while later instances begin replacement.

One key difference between this strategy and the Blue/Green deployment is how Terraform triggers an Instance Refresh asynchronously. Where as the deployment via create_before_destroy Launch Configurations blocked until the instances were healthy, the refresh exits and returns success immediately before the refresh operation finishes.

We see this caveat noted in the Terraform docs as well:

NOTE: Depending on health check settings and group size, an instance refresh may take a long time or fail. This resource does not wait for the instance refresh to complete.

Maybe some day this behavior will be configurable in a manner similar to wait_for_elb_capacity so that Terraform polls for successful completion. In the meantime though, it’s good to be aware of these nuanced differences and design accordingly.

Waiting for Instance Refresh Completion

Terraform’s treatment of Instance Refresh as an async operation can be problematic when driving Terraform from a CI/CD pipeline. In cases where the new group of instances might ultimately fail, this can appear as a successful pipeline because Terraform exits with a status of 0 prior to realizing there could be an issue with the new deployment.

Nonetheless, it’s simple to solve for this minor obstacle by augmenting Terraform with a deployment checkout script. For example, the following Python snippet polls for the instance refresh status to enable reflection back into the pipeline:

import os
import time
import sys
import boto3


def main(args):
    asg_name = os.environ.get('ASG_NAME')
    assert asg_name is not None, 'Must set ASG_NAME in environment.'

    autoscaling = boto3.client('autoscaling')

    try: 
        refresh = autoscaling.describe_instance_refreshes(
            AutoScalingGroupName=asg_name, 
            MaxRecords=1
        )['InstanceRefreshes'][0]
    except KeyError:
        print('Trigger at least one Instance Refresh first.')
        sys.exit(os.EX_UNAVAILABLE)

    while refresh['Status'] not in ['Successful', 'Failed', 'Cancelled']: 
        print(
            f"Instance Refresh {refresh['Status']} "
            f"[{refresh.get('PercentageComplete', 0)}%]: "
            f"{refresh.get('StatusReason', '')}"
        )
        time.sleep(5)
        refresh = autoscaling.describe_instance_refreshes(
            AutoScalingGroupName=asg_name, 
            InstanceRefreshIds=[refresh['InstanceRefreshId']]
        )['InstanceRefreshes'][0]

    print(f"Instance Refresh {refresh['Status']} at {refresh['EndTime']}")


if __name__ == '__main__':
    main(args=[])

To incorporate the checkout script into our terraform workflow, wrap it in a local-execprovisioner as follows:

resource "null_resource" "wait_for_refresh" {
  triggers = {
    id      = aws_autoscaling_group.refresh.launch_template[0].id
    version = aws_autoscaling_group.refresh.launch_template[0].version
    tag     = join(",", [for key, value in aws_autoscaling_group.refresh.tag : "${key}=${value}"])
  }
  
  provisioner "local-exec" {
    command = "python3 checkout.py"
    
    environment = {
      ASG_NAME = aws_autoscaling_group.refresh.name
    }
  }
}

The null_resource triggers mirror the triggers that initiate an instance refresh to ensure the checkout runs whenever a refresh occurs. Now, Terraform waits to exit cleanly until the refresh reports successful:

null_resource.wait_for_refresh: Provisioning with 'local-exec'...
null_resource.wait_for_refresh (local-exec): Executing: ["/bin/sh" "-c" "python3 checkout.py"]
null_resource.wait_for_refresh: Still creating... [10s elapsed]
...
null_resource.wait_for_refresh: Still creating... [8m0s elapsed]
null_resource.wait_for_refresh (local-exec): Instance Refresh InProgress[0%]: Waiting for instances to warm up before continuing. For example: i-0c55bfeedc8dd51a3 is warming up.
...
null_resource.wait_for_refresh (local-exec): Instance Refresh InProgress[0%]: Waiting for a health check interval to pass before continuing.
...
null_resource.wait_for_refresh (local-exec): Instance Refresh InProgress[33%]: Replacing 1 instances. For example: i-0c521779f4a587a00 is launching.
...
null_resource.wait_for_refresh (local-exec): Instance Refresh InProgress[33%]: Waiting for instances to warm up before continuing. For example: i-0c521779f4a587a00 is warming up.
...
null_resource.wait_for_refresh (local-exec): Instance Refresh InProgress[33%]: Waiting for a health check interval to pass before continuing.
...
null_resource.wait_for_refresh (local-exec): Instance Refresh InProgress[67%]: Replacing 1 instances. For example: i-0711bea678e0dabb0 is launching.
...
null_resource.wait_for_refresh (local-exec): Instance Refresh InProgress[67%]: Waiting for instances to warm up before continuing. For example: i-0711bea678e0dabb0 is warming up.
...
null_resource.wait_for_refresh (local-exec): Instance Refresh InProgress[67%]: Waiting for a health check interval to pass before continuing.
...
null_resource.wait_for_refresh (local-exec): Instance Refresh InProgress[100%]: Waiting for a health check interval to pass before continuing.
null_resource.wait_for_refresh (local-exec): Instance Refresh Successful at 2022-07-22 21:09:49+00:00
null_resource.wait_for_refresh: Creation complete after 8m4s [id=3425485254618271416]

Terraform waits for the local-exec to finish before displaying its output so it is normal not to see the status streamed into stdout in realtime. However, if an error occurs, the detailed progress is very useful for debugging. While it would be nice if Terraform supported a synchronous wait behavior natively, it’s easy enough to work around.

Final Thoughts

In this article, we explored some Terraform patterns for orchestrating more complex rollouts to avoid service downtime. There are tradeoffs to each approach so it’s good to have a variety of options in your toolbox to satisfy different use cases.

Terraform is an ever-changing ecosystem so be sure to check the docs for the latest best practices. This article is based on functionality available as of Terraform v1.2.2 with AWS Provider v4.20.1. Check out the companion GitHub repository for full code examples and other patterns.