Why Your SRE Team Isn't Scaling

Here's the pattern I see in every struggling SRE org: a team of eight is drowning. Leadership approves five more headcount. A year later the team of thirteen is drowning harder, the backlog is bigger, and the best two engineers are updating their LinkedIn. The headcount didn't fix it because headcount was never the lever.

SRE, as Google originally defined it, is software engineering applied to operations. The goal is not to run services; it's to write the systems that let services run themselves. Most SRE teams I see today have drifted back into ops — paged, reactive, tribal. Adding people to that scales the toil, not the solution.

Error budgets are a contract, not a dashboard

The SRE book is clear: if you have a 99.9% SLO, you have 43 minutes of "allowed" downtime a month. When you blow through it, you stop shipping risky changes and focus on reliability. That's the deal.

What I see instead: error budgets published on a Grafana dashboard nobody looks at, while product ships anyway because the roadmap said so. If you aren't willing to enforce the budget, you don't have an SLO. You have a wall decoration.

The product VP has to co-sign the SLO. Not the SRE lead alone.
Blowing the budget has to cost something — a feature freeze, a mandatory reliability sprint, a real change to the next PI plan.
SLOs should be user-journey-based, not system-component-based. "Checkout succeeds in under 2s at p95" beats "checkout-service CPU under 80%" every time.

The toil inventory is the diagnostic

Google's rule of thumb: no SRE spends more than 50% of their time on toil. Most teams I audit are at 75-85%. Nobody knows because nobody measures.

Run this exercise: for two weeks, every SRE tags every ticket, page, and interrupt with a category — toil, project, oncall, meeting. At the end you have a real number. The categories that surface are almost always:

Manual cert rotations. Capacity requests handled by ticket. DB failovers that should be automatic. Access requests. Redeploys because something flaked. Runbook execution that should be code.

Every item on that list is a project waiting to be scoped. The team that does those projects stops drowning. The team that adds headcount to execute them faster stays drowning.

Platform engineering is the lever

Somewhere between 2020 and 2023, "platform engineering" stopped being a rebrand and started being the actual answer. A platform team builds paved-road abstractions — internal developer platforms, golden paths — so product teams don't file tickets for routine infrastructure changes.

Concrete examples that work:

A service template (Backstage, or just a Cookiecutter repo) that ships with logging, metrics, tracing, CI, deploy pipeline, and health checks wired up.
Self-service environment provisioning via Crossplane or Terraform modules with guardrails — not via a Jira queue.
Automated cert management via cert-manager and external-dns — not via an annual ceremony.
Database provisioning through an operator with pre-approved tiers and automated backups.

If product teams can't spin up a new service without talking to your SRE team, the SRE team is a bottleneck. The bottleneck doesn't scale by hiring; it scales by eliminating itself.

On-call economics and the "ops with fancier titles" trap

A 24/7 on-call rotation with humane wake-ups needs at least 6-8 people in the rotation. Fewer than that, and you're burning engineers out. The math is unforgiving: the smaller the team, the more sleep they lose, the faster they leave, the smaller the team gets.

The worst failure mode is what I call "ops with fancier titles": you hired SREs, gave them the title, then assigned them the ops backlog — capacity, patches, tickets, keeping the lights on. They don't write software. They don't reduce toil. They are sysadmins with better salaries and worse morale.

The test: if you removed two engineers from your SRE team next quarter, would anything measurably improve 12 months from now because of code they wrote? If the answer is no, you don't have an SRE team.

The takeaway: Measure toil, enforce error budgets, and invest in platform abstractions before you invest in headcount. An SRE team that isn't shipping software that eliminates its own work isn't SRE. It's ops with a rebrand, and it will not scale.