LLM-Powered Runbooks: Where They Actually Work

The first time I watched a junior engineer paste a Grafana screenshot into ChatGPT at 2 a.m. and ask "what do I do," I knew two things. One, the runbook we had spent three months writing was never going to be read. Two, if I didn't give them a sanctioned path, they would keep doing this against unsanctioned tools.

Eighteen months later, LLM-augmented runbooks are one of the most useful things we have deployed. But almost none of the early patterns worked. Here is what stuck.

Stop treating the LLM as an answer engine

The failure mode in the first three months was always the same. An engineer asks the model "the payments service is throwing 5xx, what should I do." The model confidently recommends a restart procedure that would have been correct in 2023 but is wrong for our current topology. The engineer, tired, executes it. We learn a new way to make things worse.

The fix is structural. The LLM does not get to answer operational questions from its training weights. It gets to retrieve from our runbook corpus, summarize what it found, and cite the source. If there is no source, the answer is "I do not have documented guidance for this, escalate." That last sentence is the most important feature.

Retrieval-augmented, evidence-first

Our stack is unglamorous. Runbooks live in Confluence and a Git repo. A nightly job embeds them with a small open model, stores vectors in Postgres with pgvector, and attaches metadata (service, last-reviewed date, author). At query time, Claude Sonnet gets the top chunks, the engineer's question, and a system prompt that boils down to: answer only from these documents, quote the exact steps, show the source link, and if the evidence is thin say so.

The "show the source link" part is the difference between a tool on-call trusts and a tool on-call ignores. If the engineer can one-click into the actual runbook, they can sanity-check the summary against the full context. If they cannot, they assume the model hallucinated and type the command into Google anyway.

Handling stale docs

Half of any runbook corpus is wrong the day you write it. The other half goes stale inside a quarter. We attach a last_verified date to every chunk and surface it in the response. A runbook last verified fourteen months ago gets flagged in the UI as "stale, confirm before acting." This has done more for docs hygiene than any Confluence campaign I have run in ten years. Engineers now update runbooks because the stale badge is embarrassing.

Where it works

Two use cases carried the ROI.

On-call triage summarization. When an alert fires, a small agent pulls the last thirty minutes of logs, recent deploy metadata, and matching runbooks. It writes a six-line summary into the incident channel: what is firing, what changed recently, suggested first steps with runbook links, and a confidence rating. Engineers still do the triage. They just start from minute two instead of minute fifteen. On-call handoffs cut roughly 30 percent.

Change summarization for CAB. Generating a human-readable summary from a merged PR plus the linked ticket is the kind of task LLMs are genuinely good at. Our change advisory board meetings got shorter because people actually read the pre-reads.

Where we stopped

We tried giving the agent execution rights. We stopped. The failure modes are not the ones the demos show. The agent does not go rogue. It does something subtly wrong, confidently, and nobody notices for an hour. For anything that writes to production, the human is the gate. Full stop.

The ceiling of this pattern is a sharp one: it makes good runbooks great and makes bad runbooks dangerous. Invest in the corpus first. The model is the cheap part.