CloudFix: A FinOps Program to Cut AWS Costs (+Keep Them Down)
Software Engineering

CloudFix: A FinOps Program to Cut AWS Costs (+Keep Them Down)

by Badri Varadarajan, Portfolio CTO
CloudFix: A FinOps Program to Cut AWS Costs (+Keep Them Down)
  • Cloud Computing 2020: Rising Costs
  • AWS Costs Become Problematic
  • Welcome To Hotel California: SMBs and Enterprises
  • Maximum Effort: Rewriting Applications
  • Death By a Thousand Cuts: AWS Cost Data
  • Automated Fixing: The Technical and Organizational Risks
  • AWS Change Manager to the Rescue
  • CloudFix: The Fixer You Needed All Along

Good CFOs worry about the rising costs of cloud computing. Great CFOs ensure their CTO counterparts also worry about them. The question for such CTOs is: should they move away from the public cloud, or launch large-scale app rewriting projects to save money? We’ll argue here that neither solution works, nor guarantees cost savings. Then, we’ll walk you through a FinOps product called CloudFix that our engineering team built to solve this problem, long-term.

Cloud Computing 2020: Rising Costs

You might remember 2020 for certain other reasons, but it was also a landmark year for computing. Barely a decade after the term was first used, cloud computing costs rose higher than on-prem enterprise data centers. This should have been both old news and good news.

Worldwide enterprise spending on cloud and data centers graph, 2022

The value of cloud computing had been higher than on-prem computing for a few years already. It was easily measurable by the speed that innovative products could be brought to market, the feature-rich quality of those products, and the sheer planet-scale quantity at which they were used.

AWS Costs Become Problematic

For CFOs and a few industry observers - the cup was half empty. Rather than focus on growth, they focused on the increasing cost of growth. 

In the distinctly less sunny days of 2022 and 2023, these voices were bound to get louder. 

Welcome To Hotel California: SMBs and Enterprises

Some industry leaders have proposed the drastic solution of moving away from the public cloud, but this is neither feasible nor effective at scale.

It’s true that some cloud services are egregiously overpriced. But most other services are insanely cost efficient in the public cloud. Any CIO who has ever tried to build their own object store will tell you that S3’s net costs of less than $100 / TB / year with three nines of availability (assuming intelligent tiering) is the stuff their dreams are made of. 

Unless a service is self-contained, static and brings in a lot of revenue, the engineering needed to migrate and maintain the service eclipses repatriation savings. Leaving the public cloud doesn’t really save costs for small & medium businesses. Even large enterprises would only save a little, since giant isolated services are rare.

Maximum Effort: Rewriting Applications

A less drastic measure is to stay on the cloud, and efficiently operate it. According to the 2022 Flexera State of The Cloud Report, at least 32% of all cloud spend is wasted.

Based on our own experience as a team who operates hundreds of products on AWS, we have sometimes cut costs by more than 50% by rewriting applications. Other successful case studies (AirBnB, Spotify, have demonstrated this method works. 

However, application rewrites are not a panacea. There is a reason that ballads and blog posts are written about these success stories - they are unique hero projects. For every one of these stories, there were a myriad of projects that soaked up time and effort without any cost savings at all. 

These projects are the equivalent of someone with a weight problem having surgery or fasting. They work, but the weight comes back eventually and then you’re back to solution hunting. There are better ways to trim the fat than rewriting applications. 

Death By a Thousand Cuts: AWS Cost Data

The core problem is that cloud costs are highly distributed. Our scans of close to a billion dollars in cloud spend, across thousands of AWS accounts, reveal some key insights.

  • Half of all AWS spend is on EC2 already, so it’s not like companies are using cloud native services and paying a premium for them. If anything, we think they should do more of that, but that’s a topic for another post. 
    • As one might expect, this AWS spend is split across hundreds of thousands of EC2 instances, with the average instance costing only a few hundred dollars annually. 
    • Many customers spin EC2 instances up and down over time. So even if an engineer optimizes a particular instance, that optimization will not stick.
  • The rest is split widely across a host of other services. A third of our customers were using at least twenty AWS services, but none of them dominates costs. 
CloudFix pie chart on AWS spend by service

To sum up, cloud costs add up in small chunks. There is no one change that would significantly cut costs. Any AWS cost optimization must be done via automation. As much as continuous code integration and delivery needs DevOps pipelines, continuous cost optimization needs a FinOps pipeline.

 Key Facts for a Working FinOps Pipeline

 More broadly, achieving meaningful AWS cost reduction should address three points:

  1. Continuous changes: While some initial cost reduction can be a sprint, AWS cost optimization is an ongoing activity. The resources used by products, and the capabilities & costs of those resources continuously change. (Did we mention gp3 volumes?)
  2. Low per-unit savings: While the cumulative savings can be as high as 20%, each individual change is often miniscule. Converting a 100GB EBS volume from gp2 to gp3 saves a whopping $2 every month. It’s just that most large enterprises have thousands of such volumes.
  3. Automated fixing: Central FinOps teams can build service-level expertise and automation, but they lack knowledge of product requirements and operations. Product teams have the reverse problem. 

CloudFix: A FinOps Pipeline Built on Native Tools

Our engineering team couldn’t find a pipeline that met all of the above requirements, so we built one called CloudFix.

AWS Finops pipeline chart

The Core Principles Are: 

  • Collecting data on potential opportunities is simple. AWS makes it easy to get detailed resource configuration data and metrics. You just need to fetch and store them in suitable datastores, along with cost data from your cost & usage report.
  • Finding cost reduction opportunities is a little harder, but can largely be automated given the input data. The main requirement is that the team finding these opportunities needs obsessive and intimate knowledge of AWS services and pricing plans, which can get quite confusing for any individual product team. This is another reason why a purpose-built service for AWS cost optimization is better than asking application developers to cut costs.
    • If anything, finding cost optimization opportunities is too easy. 
    • As operators of a large AWS organization ourselves, we were constantly bombarded with vendors promising large savings on their dashboards, if only we could put engineering resources into executing their recommendations. It only made our leaders feel bad about all the money we were not saving. 
  • Fixing must be automatic. As mentioned above, any individual change may save only a few dollars. Doing them by hand is costly and exposes you to errors. 

Automated Fixing: The Technical and Organizational Risks

Unsurprisingly, the challenge was fixing, and delivering, those savings. Writing an automated fixer is quite straightforward in most cases.

The key challenge is:

  • Customers are unlikely to let some vendor - or an automation built by another team in their own company - make changes without understanding what the changes are.
  • Even if they did, they want to be able to control which ones were executed, and when.
  • Crucially, fixers should change resources. Creating an IAM role or service user with such broad powers would violate their own internal access policies, let alone provide such powerful access to a third-party vendor tool. 

AWS Change Manager to the Rescue

CloudFix was born when we realized the deep potential in Change Manager, a feature that was added to AWS Systems Manager in December 2020. Change Manager’s feature-set instantly transformed CloudFix into a scalable cost-savings tool. It did this by streamlining the change process and making it more secure. 

Change Manager chart showing automated finding and fixing process

Here is the flow:

  • CloudFix creates Change Templates describing each new type of finder/fixer. Account owners have the opportunity to review and approve the templates before any new Change Request can be made in Change Manager.
  • For each new set of resources to be fixed, a Change Request is created. CloudFix itself only has permission to create a Change Request based on an approved Change Template.
  • Change Requests are automatically executed after the account owner or designated approver approves the change. Designated approvers can be IAM users, roles or AWS SSO users or groups. The designated approvers only need permission to approve change requests and not all of the operations executed during the request.
  • Changes that have no performance impact or risk can be auto approved. The approver still receives notifications when changes are executed.
  • For more complex fixers, workflows with multiple stages of approval, multiple approvers or a group of approvers are supported.
  • Approved changes can be set to execute at specific times.
  • Change Requests can be tracked and aggregated centrally for analytics and reporting. Much to our delight, Change Manager now supports logging execution events on CloudTrail - and having it be visible from the change manager console, is a feature we have long waited for.

CloudFix: The Fixer You Needed All Along

By automating finders and fixers to identify and execute cost saving opportunities, CloudFix delivers meaningful cost savings to the whole account - via a large number of small, targeted cost reductions. 

Using AWS-native change management, automated fixing maintains the permissions boundary needed for both SMBs and Enterprises to deploy a tool on their AWS accounts. Full control and complete AWS cost optimization means long-term savings and cloud cost efficiency.

Find out more about CloudFix here.

Section Separator Top

Want to read more?
We have a lot more where that came from