Every vendor deck in my inbox over the last year has included an ROI slide. The slide is almost always wrong. Not fabricated. Wrong in a specific way: it attributes a productivity delta to the tool, and it assumes that productivity delta translates to dollars. Neither half of that sentence survives contact with the actual finance team.

Here is what the real economics look like after two years of running this.

Per-token costs are the visible part

Claude Sonnet runs in the low single digits of dollars per million input tokens and a few times that for output. GPT-4o is in the same neighborhood. A self-hosted Llama, amortized over the GPU cluster, is roughly an order of magnitude cheaper per token at scale. These numbers have fallen by about 10x in two years and will keep falling.

For the workloads we run, token spend is maybe 15 percent of the total cost of the AI program. Which means optimizing token spend is a real but secondary game.

The invisible costs are larger

What actually adds up:

  • Prompt engineering maintenance. Every prompt is a small program. It has bugs. It regresses when the model version changes. Someone has to own it. Call this 0.2 to 0.5 FTE per meaningful integration.
  • Evaluation infrastructure. If you cannot tell whether your system got better or worse after a change, you will not ship changes confidently. Build or buy evals. Budget engineering time accordingly.
  • Data plumbing. The RAG corpus, the retrieval layer, the metadata, the freshness jobs. This is where the actual value lives and where 60 percent of the engineering effort goes.
  • Integration into workflows. The AI part is a week. The integration into ServiceNow, Slack, Jira, and however users actually work is three months.
  • Governance and audit. Security review, legal review, data use documentation, model cards, change logs. Not glamorous. Non-negotiable in regulated environments.

Context window economics

Long context windows are a trap if you do not think about them. A two-million-token prompt to Gemini costs real money and takes real time. Most of the time you do not need it. Better retrieval narrows the context to what is actually relevant, and narrower contexts generally produce better answers anyway. "Just stuff everything in" is a non-solution that gets expensive fast.

Cached vs. live inference

Most providers now offer prompt caching. If your system prompt is 2,000 tokens and you send it with every query, caching that portion can cut costs by 50 to 90 percent on the cached section. For us, moving our RAG system to use prompt caching for the stable instruction portion cut total inference spend by roughly 40 percent with zero quality impact. If you are not using it and your provider supports it, you are paying a tax for no reason.

Automation versus augmentation

The ROI slides always assume automation: the AI does a task, a human does not, savings equal the human's cost. In practice, most deployments are augmentation: the human still does the task, just faster or with fewer errors. The economics of augmentation are real but indirect. A 20 percent productivity lift on a ten-person team does not mean you can fire two people. It means you can absorb 20 percent more work, or the same team can take on higher-value work, or quality improves. The CFO needs to understand this because the default financial model is wrong.

Productivity versus throughput

This is the subtle one. You can measure a developer generating code 40 percent faster with Copilot. That is a productivity gain. Whether the team ships 40 percent more features depends on code review capacity, testing capacity, deployment capacity, and meetings. Often the bottleneck is not code generation. In those cases, the productivity gain exists and the throughput gain is negligible. That is not a failure. It just means the ROI story is more subtle than "multiply the hourly rate."

How I budget

Roughly a third for inference spend. A third for platform engineering. A third for integration, governance, and training. And I assume the first year is mostly learning. The economic case gets real in year two when the platform is stable and you can roll out the same patterns across additional workflows cheaply. That compounding is the actual ROI. Skip the vendor's spreadsheet.