Cloud Cost Anomalies: Managing the Unexpected

Daniel Fleps
26. Jan. 2025
8 Min. Lesezeit

Aktualisiert: 24. März 2025

The cloud scales in seconds at the touch of a button, which is often an advantage when applications reach a peak in utilization for a short period of time. However, this elasticity also contains risks, as incorrect configurations or an unexpectedly high level of requests can quickly cause costs to skyrocket and, in contrast to the classic on-prem world, there are virtually no limits. It is therefore important to keep an eye on the utilization of your cloud resources to quickly identify possible anomalies and rectify the causes.

One speaks of an anomaly when unexpected costs occur, e.g. when the costs for a service suddenly double from one day to the next instead of increasing predictably. This blogpost illustrates a real-life example on tools, processes and responsibilities, to handle cost anomalies in AWS. As well as some additional food for thought on the topic.

How cost anomalies arise

There are various events that can trigger an anomaly. However, when dealing with these, it is important to differentiate whether an anomaly is expected or unexpected and whether it is wanted or unwanted. Depending on the combination of these factors, the handling of an anomaly determines whether there is a need for action or whether the anomaly is merely recognized. The possible variations are illustrated in the matrix below.

When people talk about anomalies, they usually mean the unexpected and unwanted ones. However, there are also often expected and wanted, or in other words, accepted anomalies. These occur, for example, during migrations or when new products are introduced. A popular example is Black Friday, where online stores expect a sharp increase in utilization, but accept this because they also generate more sales. Theoretically, however, it would of course also be possible to take action here, for example by not offering any discounts or even closing the store for the day.

This is of course an unusual scenario and is only intended to illustrate a possible action against an expected anomaly. The following are examples of anomaly triggers. These are limited to unexpected and unwanted as well as expected and wanted, as these are the most common combinations.

1. Unexpected and unwanted events

1.1 Misconfigurations

Configuration errors, e.g. in a cloud service or an application, can result in unwanted costs, for example if requests are made every second instead of every hour because a unit or number has been accidentally swapped.

1.2 Misuse / malicious attacks

If an attack occurs, e.g. on a cloud application or cloud infrastructure, the costs can rise quickly because either a large number of requests have to be answered, or the cloud application or infrastructure is misused for purposes for which it was not intended.

2. Expected events and wanted

2.1 Migration

While migrating from on-prem to cloud or within the cloud large costs that are limited to a certain period of time can incur, data traffic, for example, often plays a role here, as large amounts of data can be moved during the migration.

2.2 High workload

Anomalies can also occur through peek workloads. Imagine an online shop on Black Friday or when a new product is launched. The increased volume of requests will also increase the cloud bill, but that is to be expected.

2.3 New projects

When a new project launches and starts using the cloud there is no reference for their consumption, so you might see an impact for some services on a global scale. New cloud workloads should be tracked to make sure everything works as intended. As soon as the project runs in a steady state, thresholds for alarm can be set. It’s also a good idea to give projects, which are new to the cloud, basic onboarding, the FinOps team can help with that.

What is the relevance of anomaly management?

Anomaly management enables FinOps teams to recognize unexpected cloud costs, investigate them and get them under control. Expected anomalies can be acknowledged and excluded from alerts. The cloud environment is often already being monitored from a security perspective, and there may already be control mechanisms in place that can also be relevant from a cost perspective. Standardized processes for detecting and analyzing anomalies are crucial to detect and deal with anomalies quickly and reliably. This is the only way to avoid major anomalies with high business impact.

How can anomalies be recognized?

Such processes are usually supported by automated tools that use machine learning to recognize anomalies. These are offered both by the cloud providers themselves and by third-party providers. Following are some indications on how you can identify anomalies.

Option 1: Implement Detection Tools

Use cloud-native or third-party tools to monitor cost and usage patterns. A list of possible tools can be found bellow:

Native tools from cloud providers like AWS Cost Anomaly Detection (see screenshot)
IBM Cloudability
Finout
CloudHealth by VMware

Tool based anomaly detection based on the example of AWS Cost Anomaly Detection*

Option 2: Set Up Alerts

Configure alerts for unusual spending spikes or usage deviations. The granularity and the area to be monitored (e.g. per service or per stage) can be defined individually depending on the tool.

Option 3: Automate Policies

Enable automated policies for immediate response to anomalies.

Reommend for all Options: Analyze Anomalies

Regularly review and analyze anomalies to understand root causes and mitigate risks. This can also improve your future monitoring.

A typical workflow for anomaly monitoring proceeds as follows:

Configure the monitoring (granularity, area, threshold)
Get alerted and check unusual spend
Analyze the root cause to understand where the anomaly is coming from
Optimize your monitoring and draw learning from past anomalies

Who can do what if an anomaly occurs?

Setting up tools and alerts is a first step to find anomalies, but to process them and improve the handling of anomalies processes and responsibilities need to be defined. Following described are typical roles and what they could and should do to handle anomalies in a good manner.

Cloud Provider (AWS, Azure, GCP)

Cloud Provider can inform about anomalies using set budgets and machine leaning algorithms. But their support teams can also help you to identify the cause of an anomaly and provide information on how to resolve it.

Cloud Center of Excellence

Most big companies run a Cloud Center of Excellence (CCoE), which is a team for managing cloud services and supporting internal customers who are using the cloud. As they have not only insight into all cloud accounts of the organization but also basic control, they can not only inform account owner about anomalies but are also able to shut down a service in extreme cases when there is a high risk. They can also request information from account owners to better understand anomalies.

FinOps Team

The FinOps Team is a specialized team, often within the CCoE, and is familiar with the way several cloud services get billed. Therefore, it can analyze and find even complex causes for anomalies. It can then inform teams which are affected by the anomaly and give them context, where it is coming from and what they can do to handle it. The FinOps Team can also ask for additional information from teams, to gather all needed information for clarification.

Account Owner

Account Owners are responsible for their cloud workloads, so in the first place they should take care to avoid anomalies. When an anomaly occurs, they can do their own analysis, but they can also inform other teams like the FinOps team to raise awareness and get support when needed. When they plan actions which could foreseeable lead to anomalies, they can also inform the FinOps team, so they can better evaluate possible alerts. They can also ask for information if something is unclear to them. As they are in direct control of their cloud account, they can also reconfigure or shut down affected resources.

Security Team

The security team can and should be involved, especially if anomalies are caused by misuse. If an attack is already known to the security team, it can inform the FinOps team that an increase in cloud costs is to be expected. If an attack only becomes apparent through rising cloud costs, the security team should be called in quickly to resolve the problem.

Practice example

The following practical example is intended to show that the complexity of cost anomalies can also be higher and that the cause of an anomaly is not always recognizable at first glance. In addition, the exemplary process of monitoring including the analysis will be shown.

In the example case, AWS Cost Anomaly Detection was used. The tool informed us by email that there was a cost anomaly for the Amazon Elastic Container Service. It was initially unclear whether the change in costs was spread across many accounts or whether certain accounts were particularly affected. The following screenshot shows the cost trend for the service.

*Distribution by Accounts (Group by dimension “Linked account”)*

To get a better picture, you should look at different dimensions of the service concerned. Among other things, we decided to check the charge type and obtained the following picture.

*Distribution Savings Plan vs. On-Demand usage (Group by dimension "Charge type")*

It can be clearly seen here that almost the entire share of Savings Plan coverage for the service decreases at the start of the anomaly. However, nothing has changed in the service itself. Depending on the case, it would also make sense to obtain further information from the teams concerned. In this case, however, the central Savings Plans were purchased from the Cloud Center, and we had insight into the data and were able to determine that no Savings Plan had expired. So why did the coverage dwindle?

Next, we looked at which services still benefit from the Savings Plan and saw the following picture.

*Distribution of Savings Plan (Group by dimension "Service", Filter: Charge type = “Savings Plan Covered Usage")*

Here you can see that the coverage for Container Services drops to 0, but almost 100% of the coverage is for EC2. It is important to note that Savings Plans are always distributed across the instances where they can achieve the greatest savings potential, different instances benefit differently from savings plans, some are therefore 30% cheaper, for example, while others are 42% cheaper.

The next task was to find out whether there were new instances to which the Savings Plan had been shifted. Therefore, we tracked the use of different instance types.

*Distribution of EC2 Savings Plan usage (Group by dimension "Instance Type", Filter: Service = "EC2", Charge type = “Savings Plan Covered Usage")*

At the time of the anomaly, the g5.xlarge instances stand out, which are also reflected in the form of the anomaly's costs. As a final step, we decided to check which accounts use this instance type. We also limited the search to instances covered by the Savings Plan, as only these were relevant for our anomaly anyway.

*Increase of g5.xlarge usage (Group by dimension "Linked account", Filter: Service = "EC2", Instance type = "g5.xlarge", Charge type = “Savings Plan Covered Usage")*

It is now clear to see that a big number of new instances have been launched, mostly from a single account (in the screenshot above, each color represents an account). We contacted the account owners to inform them about the incident and to discuss with them how long they expect to use the instances. One possible measure would be to reserve them at account level, so that the capacity of the Master Savings Plan is freed up again for the other services that previously benefited from it.

Conclusion

In the practical example shown, it is easy to see that many factors can be interdependent in cloud environments, and it is not always immediately obvious where an anomaly comes from. A change in one account or service can also affect other accounts or services for which no configuration changes have been made. An example was given showing an anomaly arising from the automatic distribution of savings plans throughout an organization.

FinOps helps to create transparency and understanding for cloud environments, which simplifies the identification of anomalies. With a deep understanding of cloud costs and services, the benefits and risks of different cloud features such as shared savings plans can be better balanced out and the value of the cloud can be maximized. It is therefore worth taking a closer look not only at anomalies in isolation but also at the cloud environment as a whole in combination with the FinOps practice.

Helpful Links:

AWS Cost Anomaly Detection: https://aws.amazon.com/de/blogs/aws-cloud-financial-management/announcing-general-availability-of-aws-cost-anomaly-detection/

Cloud Cost Anomalies: Managing the Unexpected

How cost anomalies arise

How can anomalies be recognized?

Who can do what if an anomaly occurs?

Conclusion

Aktuelle Beiträge

Kommentare