Cloud Cost Optimization: The Framework

FinOps became a job title around 2021. By 2023 it was a department. By 2026 most of those departments are still reporting the same 3-5% savings they reported the first quarter, because they've been asked to optimize a bill that's fundamentally a design problem. Rightsizing an EC2 instance is not the answer. It's the rounding error.

The biggest cloud cost wins are architectural. They're decided by engineers at design time, not by finance reviewing cost-center dashboards. What follows is the framework I use — and the order matters.

Rightsizing vs. re-architecting

Rightsizing is table stakes. Every cloud provider has a native tool (AWS Compute Optimizer, Azure Advisor, GCP Recommender) that will tell you which instances are oversized. Apply the recommendations, enforce instance-family standards, autoscale aggressively on stateless tiers. You will save 10-20% on compute. This is the first week's work.

Re-architecting is where the real money lives:

A chatty microservice topology crossing AZs at every request — move services that co-call into the same AZ or collapse them.
A batch job running on an m5.24xlarge reserved instance 24/7 to handle a 4-hour overnight workload — move it to Spot or Batch.
A managed database 4x oversized because you picked tier once and never revisited — evaluate Aurora Serverless v2 or equivalent.
Lambda cold starts driving a business decision to provision a warm fleet — which then costs more than the equivalent ECS service would.

Rightsizing is trimming. Re-architecting is stopping the bleeding.

Commitments: RIs, Savings Plans, Spot

The commitment hierarchy is well-understood but badly executed:

Savings Plans (compute) — 1-year, no upfront, 20-30% off list for flexible workloads. This should cover your steady-state baseline. Almost everyone under-commits here.
Reserved Instances — for databases, ElastiCache, RedShift, where commitments are instance-specific. Higher discounts but less flexibility. Match to workloads that won't change shape for 12+ months.
Spot instances — 60-90% off for interruptible workloads. Should be the default for CI runners, async batch, stateless stateless autoscaled tiers, spark workers. The tooling (Karpenter on EKS, Spot.io) is good enough that "we can't risk interruptions" is rarely a valid excuse.

The thing nobody does: continuously re-evaluate. A Savings Plan from 2024 covering workloads that have since moved to Graviton is burning money.

The data transfer bill

Data transfer is the most under-examined line item in the entire cloud bill. It doesn't show up as one category — it's scattered across "EC2 Data Transfer," "S3 Data Transfer," "CloudFront," "Inter-AZ." Add them up and it's often 10-20% of a mature cloud bill.

An architecture that does not treat cross-AZ and cross-region traffic as expensive will get an expensive surprise on the invoice every month forever.

Specific places to look:

NAT Gateway egress — $0.045/GB processing plus per-hour cost. A chatty service talking to the internet through a NAT gateway can rack up thousands a month. VPC endpoints for AWS services eliminate most of this.
Cross-AZ replication in Kafka, Kubernetes, databases. Topology-aware routing (EKS) and rack-aware Kafka producers help.
Logs shipped to a central aggregation account across regions. Compress and batch.
S3 cross-region replication set up once for DR and never revisited — check if you still need it for every bucket.

Orphans, showback, and the cultural layer

Every cloud account I've audited has 5-15% in pure orphaned resources: unattached EBS volumes, old snapshots, load balancers pointing to nothing, stopped EC2 instances still paying for EBS, NAT gateways in dev VPCs that no one uses. Automated cleanup (AWS Config rules, custom Lambdas, tools like Cloud Custodian) pays for itself in a month.

The cultural layer is where FinOps programs either work or fail:

Showback: every team sees what they spend, weekly, via Slack or email. No chargeback. Just visibility. This alone drives 5-10% behavior change.
Chargeback: when cost shows up in the team's own P&L. More powerful, harder to implement fairly, and requires a tagging discipline most orgs don't have.
Unit economics: cost per customer, per order, per API call. This is what actually matters — absolute cloud cost is vanity, cost per unit of business output is strategy.

When cloud is actually more expensive than on-prem

There's a cost crossover that nobody in cloud marketing will acknowledge. Sustained, predictable, high-throughput compute — the kind that runs 24/7 at 80%+ utilization — is often 2-4x more expensive in the cloud than colocated hardware amortized over 4 years. Storage at petabyte scale, same story. If you're running an HPC workload, a video encoding farm, or a petabyte analytics platform with steady utilization, do the math honestly. Companies that moved it back (37signals, Dropbox earlier) weren't wrong.

The takeaway: If your FinOps program is mostly rightsizing reports and RI purchases, you're harvesting the easy 15% and missing the structural 40%. Design cost in — instance families, regions, transfer topology, commitment strategy — and treat orphans and cross-AZ traffic as bugs. And be honest about the workloads where cloud isn't the right answer in the first place.