Blog

We believe empowering engineers drives innovation.

MONITORING CLOUD AUTOMATION SITE RELIABILITY ENGINEERING

Why monitoring in the cloud is important, and why it’s important to not be loud

By Kurt Bomya

March 7, 2023

The beauty between noise and silence

We’ve become accustomed to controlling the notifications of our lives. We tap unsubscribe on emails that offer it, we become frustrated with apps demanding too much attention, and we escape our typical work environment to that sweet single-person meeting room to get some uninterrupted work done. We crave peace.

By contrast however, it’s the very nature of joining society that we must pull up a chair and turn up the volume high enough to catch the next wave. We cannot blindly delete the content in our inboxes as there is ever-present opportunity lying just beneath the surface in the pools of data that gather there. Our inputs hold potentials for job offers, long-awaited leads, friends in need, major life updates and more.

So what are we to do? We can’t comprehend all the data the world offers nor can we mute it all! We have to locate that perfect signal-to-noise ratio. It’s time to find that beauty between.

Monitoring where it all begins

The pyramid below is taken from Google’s Site Reliability Engineering online book and simultaneously serves as the inspiration to the content of this blog. Google outlines the critical need for Monitoring as the prerequisite for any other stability-focused activity an engineer can execute.

Consider now the very next tier above monitoring; Incident Response. This dependency locks Monitoring as the lifeblood of observability and the most critical component to running a stable service. Successful incident response requires engineers to navigate backward in time by diving into the emitted data of the past; hoping it exists and hoping it’s enough to correct the root cause.

“Hope is not a strategy”
- - Traditional SRE saying

Those Site Reliability Engineers are right! We must be proactive in strengthening our environments!

The monitoring pipeline

The bottom monitoring section on this pyramid handles everything from sensors to alerting. Let’s break down the components of a common monitoring pipeline. It is worth keeping in mind that many modern monitoring stacks combine some of these functions into the same tool.

Collectors - The component installed into the endpoints that we would like to gather information from. The collectors are the sensor layer which must be configured to send or present data that we would be interested in. Data gathered is generally sorted into 2 categories, Metrics, and Logs.
- Metrics commonly display a specific gauge value over a period of time, such as CPU utilization expressed as a percentage or a type of counter that begins at 0 and increments with each passing transaction to facilitate traceability through a system.
- Logs are timestamped lines of text exhaust from the machine, container’s standard out, or application.

Push vs Pull

Example collectors include Beats in some push models and node_exporter in pull models, presenting the data for some scraper to come grab later in this pipeline. Application level metrics can also be exposed using custom exporters.

Aggregation - The endpoint which receives all data from the previous layer, be it fetched or delivered and compiled into common data sources. Some example aggregators include Prometheus and Logstash.
Indexing - The service responsible for making the data rapidly searchable enabling reasonable query response in short order. ElasticSearch is a common indexing agent in this space.
Storage - The indexed data is placed at rest for querying, alerting, and visualization. InfluxDB and Prometheus exist as time series database solutions.
Alerting - Configured parameter(s) that will fire some kind of response event; inform a mailbox, page a human, or execute self-healing activity.
Visualization - Optional but eye-opening, the visualization layer turns organized data collections into dynamically updating rich graphs and visualizations. Examples include Kibana and Grafana.

While many open source tools exist in monitoring, many require a heavy tuning to nail that perfect balance we seek. There’s a quicker way to get going in your cloud.

AWS Cloudwatch

Enter Cloudwatch! Out of the box AWS provides a monitoring solution called CloudWatch. There are many AWS services which publish their metrics to CloudWatch. You may view that list in the AWS Cloudwatch Documentation.

When launching a new EC2 instance for example, AWS immediately presents a basic monitoring tab with 14 metrics. While these outputs leave much to be desired it’s a starting point nonetheless! AWS services use a push model to send logs to CloudWatch, assuming metrics are enabled. Cloudwatch retains this metric data for 15 months, albeit down-sampled as the data points age.

Cloudwatch Alerting

Configuring well tuned cloudwatch alarms in your environment is critical in making sure you aren’t being alerted for non-actionable events. If you acknowledge a page and take no action and everything remains fine in your environment then you’ve too much noise! Examine this aws_cloudwatch_metric_alarm configuration expressed in terraform below.

How can it be modified to increase our signal to noise ratio?

resource "aws_cloudwatch_metric_alarm" "noisy-alarm" {
  alarm_name                = "noisy-alarm"
  comparison_operator       = "GreaterThanOrEqualToThreshold"
  evaluation_periods        = "1"
  metric_name               = "CPUUtilization"
  namespace                 = "AWS/EC2"
  period                    = "120"
  statistic                 = "Maximum"
  threshold                 = "50"
  alarm_description         = "This metric monitors ec2 cpu utilization, and is way too noisy!"
  insufficient_data_actions = []
}

The Four Golden Signals

The Site Reliability Engineering: Chapter 6 - Monitoring distributed systems chapter covers the concept of the four golden signals:

Latency - For each request, how long does it take for them to get through our service end to end?
Traffic - How many requests over a rolling time window are we handling?
Errors - How many failure events in a rolling time window is concerning?
Saturation - How full is our service and underlying systems?

These signals are the most common tipping points in a system. Keep in mind these golden signals to help us surface metrics worth firing alerts for since there’s action we can take! For example, if CPU utilization is consistently alerting, it may indicate we need to scale the load vertically, shutting down the machine, migrating the disk to a new instance, reattaching the load balancer and resuming the workload. In environments with higher uptime requirements, we can leverage autoscaling to scale horizontally, adding nodes of the same strength to the load balancer pool without firing an alert at all. This evolution is the ultimate goal of the SRE team as SRE team workload should scale exponentially with headcount, reducing interactions while increasing overall Service Level Objective health.

Returning now to our CloudWatch alarm. If we aim to categorize CPU utilization, it is a type of fullness or in golden signal language, saturation. We want to make sure this alarm doesn’t needlessly fire. Here’s a few settings which should stand out:

statistic is set to Maximum
- AWS defines Maximum as “the highest value observed during the specified period”. This can lead to needlessly noisy alarms as each period where the maximum exceeded the threshold, this alert will fire.
- AWS defines Average as “the value of Sum/SampleCount during the specified period” and serves as a more suitable starting point when establishing your service alerting configuration. This will filter out the brief but intense spikes that can occur in a service which don’t cause disruption.
evaluation_periods is set to 1
- It’s unlikely that a single evaluation period will be enough to raise confidence in our monitoring that we should fire an alarm. Generally we want to see at least 2 periods in violation. If there’s a concern that there’s too much time between periods, consider decreasing your period (duration) before decreasing the evaluation_periods (count).
threshold is set to 50
- We should be good stewards of our resources. If you fire an alert at 50% CPU utilization, your instance is almost two times larger than it needs to be or this alert is too noisy. Either way, action should be taken to not be alerting such low CPU usage unless this symptom can be combined with another to detect an impending outage event with the use of a aws_cloudwatch_composite_alarm.

Here’s the terraform code for the CPU utilization alarm, 2.0! It’s quieter and thus more likely to fire when a problem is truly cropping up and needs maintainer attention. Each alarm you configure will need careful tuning to not back off so far that you’re blind to failure events and simultaneously not so loud that you grow in the habit of snoozing alarms.

resource "aws_cloudwatch_metric_alarm" "quieter-alarm" {
  alarm_name                = "quieter-alarm"
  comparison_operator       = "GreaterThanOrEqualToThreshold"
  evaluation_periods        = "2"
  metric_name               = "CPUUtilization"
  namespace                 = "AWS/EC2"
  period                    = "120"
  statistic                 = "Average"
  threshold                 = "80"
  alarm_description         = "This metric monitors ec2 cpu utilization"
  insufficient_data_actions = []
}

Getting one alarm into shape is good, but we want to be great!

Stacking Alarms with CloudWatch Composite

Failure states are often more nuanced than a single metric on a single instance. For example, you may have a service which only fails when multiple conditions occur. We need to relate them!

AWS CloudWatch offers a solution for combining alerts called CloudWatch Composite Alarms. The aws_cloudwatch_composite_alarm enables you to combine several alarms into a composite alarm which only fires on certain conditions.

Not that we actually want to but for the sake of demonstration let’s merge our two previously defined alarms, noisy-alarm and subsequent quieter-alarm, into a single composite alarm!

resource "aws_cloudwatch_composite_alarm" "composite-cpu-high-example" {
  alarm_description = "This is a composite alarm!"
  alarm_name        = "cpu-high-example"

  alarm_rule = <<EOF
ALARM(${aws_cloudwatch_metric_alarm.quieter-alarm.alarm_name}) OR
ALARM(${aws_cloudwatch_metric_alarm.noisy-alarm.alarm_name})
EOF
}

Of course this example is silly. We usually wouldn’t ever want to merge two identical alerts with varying settings into a composite alert like this. Let’s reach for some more realistic examples!

resource "aws_cloudwatch_composite_alarm" "disk-rooted-causing-api-failures" {
  alarm_description = "This composite alarm will fire when the disk appears to be causing api failures to increase"
  alarm_name        = "disk-rooted-causing-api-failures"

  alarm_rule = <<EOF
ALARM(${aws_cloudwatch_metric_alarm.disk.alarm_name}) AND
ALARM(${aws_cloudwatch_metric_alarm.http500s.alarm_name})
EOF
}

Above we’ve configured 2 cloudwatch alarms feeding into 1 composite alarm.

disk shown in red is “on”. This alert is currently firing.
HTTP 500s > 2% of all traffic shown in blue is “off”. This alert is not firing.
The composite alarm, disk-rooted-causing-api-failures is an AND function and will only fire when both of these conditions are also firing. In this example they are not both firing, so this alert is inactive.

Composite Alarms as inputs to Composite Alarms

In this last example we’ve configured 3 cloudwatch alarms (left) feeding into 2 composite alarms (middle) feeding into yet another composite alarm (right). The various sensors implemented into this final example each provide different context to the overall system health. As the inputs change the alarms adjust accordingly providing the engineer with the right level of awareness to respond.

Incremental Improvement

An almost constant mentality that Site Reliability engineers need to have is a stance of creativity and motivation. In most scenarios, you have the power to improve the situation. You have the ability to tune the alerts in your environment to make these hills sing.

Think through the alerts that are plaguing you today. Produce a report containing metrics on how frequently your alerts fire. Write a list if it helps your mental state, striking a tally next to each alarm when it goes off. Journal in your post-mortems. What did you have to do? Anything? Did you immediately verify something else in your infrastructure when the alert fired? How can this something else be combined to improve the systems noise level?