Every few months a new model tops the public leaderboards and the question lands in my inbox: should we switch. The honest answer is almost always "probably not, and anyway you are asking the wrong question." The model you pick for an enterprise IT workload has almost nothing to do with the number at the top of the leaderboard and almost everything to do with constraints the leaderboard does not measure.

Start with the constraints, not the benchmark

Before any capability question, answer these:

  • Data residency. Where can this data legally be processed? For European or healthcare data you often cannot send it to a US inference endpoint at all, regardless of contractual assurances.
  • Latency budget. Is this interactive (sub-second)? Near-real-time (a few seconds)? Batch (minutes)? That choice kills half the options immediately.
  • Volume. Ten queries a day or ten million? At scale, a 10x cost difference per token is the difference between a pilot and a budget memo.
  • Tolerance for errors. What does a wrong answer cost? A summary is cheap. A ticket routed to the wrong team is more expensive. A command executed against production is career-ending.

Open versus closed

This debate is more religion than engineering. The pragmatic take: closed models (Claude, GPT-4o, Gemini) are currently the best at the highest-difficulty tasks. Open-weight models (Llama, Qwen, Mistral, Gemma) have closed most of the gap on mid-difficulty tasks at a fraction of the cost and with full control.

Run the closed model for the hard 20 percent of your volume. Run the open model on your own infrastructure for the easy 80 percent. A small router model decides which is which. Most of our workload now runs on a Llama variant hosted on vLLM. Claude handles the hard cases. Costs went down by roughly 70 percent without a measurable quality regression.

Small specialized beats giant generic

The most counterintuitive result of the last two years: for a narrowly defined task, a small fine-tuned model frequently beats a giant general-purpose one. A 7B parameter model fine-tuned on your ticket-classification data will outperform GPT-4o at the specific task, at one-hundredth the cost and with lower latency. It will be dumber at everything else. For that one task, you do not care.

This is the pattern I would evaluate before assuming you need a frontier model: can you define the task narrowly enough to fine-tune a small specialist.

Fine-tuning economics

Fine-tuning is worthwhile when you have clear task specifications, a few thousand high-quality examples, stable requirements, and volume that justifies it. It is a trap when any of those are missing. Most teams that asked me about fine-tuning in 2024 should have been asking about better prompts or better retrieval instead.

If you do fine-tune, budget for re-tuning. Your task definition will drift. The base model will get a new version. You will do this more than once.

Quick sketch of the frontier options

Claude (Anthropic). Best instruction-following in my evals. Strong on long-context reasoning and structured extraction. Competitive pricing. Good enterprise contract terms.

GPT-4o / o-series (OpenAI). Broadest ecosystem. Best-in-class tool calling. The reasoning models are genuinely useful for planning tasks. Pricing is tier-dependent.

Gemini (Google). The best long-context window in the business if you actually need two-million-token prompts. Tight integration with Google Cloud and Workspace. Quality is competitive if uneven.

Llama (Meta) and friends. The default for self-hosted. The 70B and larger variants are respectable on most tasks. The real win is control, not raw quality.

How I actually decide

Write down the task. Define three to five evaluation examples that a correct answer must get right. Run them against two or three candidate models. Measure cost, latency, and quality on your actual data, not on MMLU. Pick the cheapest option that clears the quality bar with margin. Revisit in six months, because the landscape will have moved.