$67,949 spent on AWS in November - A full breakdown of ConvertKit's AWS bill

general engineering aws
$67,949
Kris Kris Hamoud

Overview

We spent $67,949.83 on AWS in November. This is up 0% from October and is 4.2% of MRR in November. November was a pretty boring month from a billing perspective. We added a few more on-demand instances, but because the month was shorter than October, we spent about the same on EC2. The most noticeable increase comes from our Elastic Load Balancing (EC2-ELB) bill. We added a couple of new load balancers to our infrastructure, and we had a spike in traffic on Cyber Monday. The increased traffic and load balancers caused the ELB bill to increase.

High-level breakdown:

  1. EC2-Instances - $26,482.83 (+1%)
  2. Relational Database Service - $19,616.17 (-3%)
  3. S3 - $7,163.48 (-4%)
  4. EC2-Other - $5,876.29 (+6%)
  5. Support - $4,383.84 (0%)
  6. EC2-ELB - $1,718.38 (+18%)
  7. CloudWatch - $850.23 (+9%)

EC2-Instances - $26,482.83 (+1%)

Our EC2 bill increased despite November being shorter than October. We had to expand our Elastic Stack storage space by increasing our number of reserved i3.2xlarge instances. We learned after doing so that it is still not enough to handle our current log volume. In the upcoming months, we'll need to find a more storage optimized instance type for our Elastic Stack if we want to scale our logging infrastructure and maintain reasonable cost-efficiency. We consume between 300GB and 400GB of logs per day. The i3.2xlarge is more than capable of handling the IO, but we'll need much more storage if we want to maintain at least a 30-day retention window. In the future, we're looking at the i3en.2xlarge Instance type. It is a similar family of instance with high IO, but this particular instance also comes with 2x2500GB (5TB total) SSD we can run in RAID0. The individual instances are more expensive than the i3.2xlarge, but we would need fewer of them to store all the logs we want.

Our c5 usage also increased this month. We use the compute-optimized instances for compute-heavy workloads, specifically email sending. Black Friday and Cyber Monday are very high volume days for us, so we had to scale out the number of email sending workers in our fleet to handle the volume.

Moving forward we have put a lot of work into optimizing CPU consumption. We've had many instances sitting around idle for most of the day only to perform minimal amounts of work at random times in the day. We will see wins in December as we've made effort to increase CPU utilization across the fleet by consolidating our sidekiq workers to remove dead weight. This has lead to a decrease in instance count and an increase in fleet-wide CPU consumption. We made sure that these changes had no effect on latencies as well. We're getting more performance out of fewer machines at no cost in latency.

Here are a couple of graphs to illustrate the changes.

Below is the CPU use of one worker group over time since we started consolidating jobs. Worker CPU Over Time

Below is the average CPU consumption average across our production account before and after we made the changes.

CPU Utilization Before CPU Utilization After

Here is our web instance request count numbers for the past month.

Service breakdown

  1. USE2-HeavyUsage:i3.2xlarge - $7,768.81 (+2%)
    • These are our reserved Cassandra and Elasticsearch clusters.
    • We use Cassandra to store massive amounts of data.
    • We use Elasticsearch to search through massive amounts of data and to store our logs.
    • There is a 2% increase despite the November being shorter than October.
    • The increase comes from additional reserved instances we purchased for our Elastic Stack.
    • We need a more storage optimized instance type to handle our log volume, or this bill will spiral out of control.
  2. USE2-BoxUsage:c5.2xlarge - $3,759.26 (+4%)
    • These are on-demand instances.
    • We never increased the number of these instances but our reservations were being applied to smaller c5 instance types.
  3. HeavyUsage:i3.2xlarge - $2,773.44 (-3%)
    • These are reserved Cassandra instances.
    • We use them for our secondary Cassandra cluster.
    • The price is lower because November is shorter than October.
  4. USE2-HeavyUsage:c5.2xlarge - $1,866.24 (-3%)
    • These are reserved instances.
    • We use these for our web servers.
    • We'll pay this much until Q3 2020.
  5. USE2-DataTransfer-Out-Bytes - $1,691.23 (+2%)
    • This is the cost of our services to communicate with the internet.
    • Black Friday and Cyber Monday are high volume days for us which is why we see the increase here.
    • We'll likely see some billing wins here in the future because we migrated away from our old logging provider.
  6. USE2-BoxUsage:t3.medium - $1,317.64 (-8%)
    • These are on-demand instances.
    • We use these instances for everything from email tracking to Elasticsearch indexing.
    • Despite their high usage we move around a lot of instance types in December so we should see this bill decrease. t3.medium CPU Usage
  7. USE2-BoxUsage:c5.xlarge - $1,721.52 (+75%)
    • These are on-demand instances.
    • We use these instances for compute-heavy services such as email sending and event tracking.
    • The increase comes from our increased traffic volume on Black Friday.
    • These instances are heavily used but we can probably get more performance out of them. c5.xlarge CPU Usage
  8. USE2-BoxUsage:t3.xlarge - $1,318.25 (-3%)
    • These are on-demand instances.
    • We use these instances for a variety of jobs that rely on burstable CPU.
    • We got pretty good use out of these instances but in December we scaled in the number of these instances by quite a lot so this will be a cheaper bill in the future.
    • The instances that do remain can probably be downsized to t3.large or smaller. t3.xlarge CPU Usage
  9. USE2-BoxUsage:t3.large - $903.96 (-14%)
    • These are on-demand instances.
    • We use these for a variety of different jobs.
    • Because these instances didn't get very good usage, we scaled them in, and now we only use them for a small variety of work around our infrastructure. t3.large CPU Usage

Relational Database Service - $19,616.17 (-3%)

November was a pretty boring month for our RDS bill. We increased our storage a week before Black Friday and Cyber Monday, but that was the only change. The rest of the line items in this bill will only be different because November is shorter than October.

Service breakdown

  1. USE2-HeavyUsage:db.r5.12xl - $4,790.02 (-3%)
    • This instance is reserved.
    • This is our master MySQL database.
    • We'll continue to pay this much until Q3 2020.
  2. RDS:ChargedBackupUsage - $3,512.51 (-3%)
    • These are our disaster recovery backups.
    • We take additional backups and send them to a different region in case of emergencies.
  3. USE2-InstanceUsage:db.r4.8xlarge - $2,764.80 (-3%)
    • This is an on-demand instance.
    • This replica is being kept around because we will need it to maintain a healthy application until the end of 2019.
    • At that time, we can consider getting rid of it or downsizing and reserving it and using it for other purposes within the company.
  4. USE2-RDS:ChargedBackupUsage - $2,517.29 (-6%)
    • These are our normal backups.
    • These have been functioning properly forever.
  5. USE2-RDS:Multi-AZ-GP2-Storage - $1,822.14 (+4%)
    • These are daily charges.
    • The cost increased because we added storage in November in preparation for Black Friday and Cyber Monday. RDS Multi AZ Storage
  6. USE2-HeavyUsage:db.r4.8xlarge - $1,596.67 (-3%)
    • This instance is reseved.
    • This is our MySQL replica.
    • We'll continue to pay this much until Q3 2020.
  7. USE2-RDS:GP2-Storage - $1,326.46 (+5%)
    • This is the cost of our storage.
    • It increased because we added storage to prepare for Black Friday and Cyber Monday. RDS gp2 Storage

S3 - $7,163.48 (-4%)

S3 continues to be an exciting bill to watch. Since migrating much of our data transfer to Cloudflare last month, this bill has continued to decrease. It's partly to do with November being shorter than October, but the slope of our daily spend is also decreasing steadily.

Service breakdown

  1. DataTransfer-Out-Bytes - $2,113.34 (-23%)
    • We saw more wins since we migrated our links behind Cloudflare.
    • We should see more wins as we migrate the rest of our legacy links behind Cloudflare. S3 Data Transfer
  2. USE2-DataTransfer-Out-Bytes - $1,565.55 (0%)
    • We're probably near the bottom of where this bill can go.
    • The cost was flat and our daily spend was flat. S3 Data Transfer
  3. USE2-TimedStorage-ByteHrs - $1,621.54 (+11%)
    • This is the steady growth of our backups coming from our Cassandra and Elasticsearch clusters.
    • This will continue to grow as the amount of data we store grows. S3 Data Transfer
  4. TimedStorage-ByteHrs - $1,124.56 (+13%)
    • This increase comes from our secondary Cassandra cluster.
    • It will continue to grow as the amount of data we store grows. Cassandra Backups

EC2-Other - $5,876.29 (+6%)

Service breakdown

  1. USE2-DataTransfer-Regional-Bytes - $1,991.94 (+30%)
    • This is the cost of replicating data in our data stores across AWS regions.
    • The bill increased because we had a couple of really high volume email sending days, including Black Friday.
  2. USE2-NatGateway-Bytes - $1,568.79 (-11%)
    • We use a NAT gateway for our services to communicate with the internet.
    • We can expect this cost to drop in the future by a small amount because we got rid of our old logging provider.
  3. USE2-EBS:VolumeUsage.gp2 - $1,367.95 (+1%)
    • This is the cost of having gp2 disks connected to our instances.
    • It increased by 1% just like our EC2 bill did.

Support - $4,383.84 (0%)

  1. 7% of monthly AWS usage from $10K-$80K - $3,014.62 (+3%)
    • This is the cost of only our production account.
  2. 10% of monthly AWS usage for the first $0-$10K - $1,369.22 (-8%)
    • This is the cost of our production account and billing account.
    • We could save money by turning off support for our billing account.

EC2-ELB - $1,718.38 (+18%)

These are the load balancers that we sit in front of our application. They distribute requests across our servers and run health checks to ensure that requests are routed to healthy servers. A majority of this cost comes from data transfer. The more popular our application gets, the more we'll have to pay in load balancer costs. We can decrease this cost in the future as we replace some of our internal load balancers with services in Kubernetes.

CloudWatch - $850.23 (+9%)

We saw an increase in CloudWatch events. As our application gets more popular, we'll have more people using the service. We haven't updated our CloudWatch alarms in a long time, and our autoscaling policies are out of date as well. Once we fix these issues, we should see our CloudWatch stabilize and stop increasing. CloudWatch Costs

Conclusion

It's exciting that nothing exciting happened to our AWS bill. The less exciting our bill is, the more predictable it becomes. Due to its stability in November, we have purchased an AWS Savings Plan in December. The savings plan will lead to a cheaper EC2 bill for all of 2020. We've gotten to the point where the smaller line items on our bill are more noticeable. Noticeable line items mean we've consolidated a lot of our infrastructure. We can see small bills increasing before they become large bills and can work to optimize them before they become problematic.