AI in IT Operations: A Reality Check from the Trenches

If I had to name the most common feature of vendor pitch decks this year, it would be a slide with "AI-powered" in front of a product that last year was simply powered. That framing obscures a real question: where does generative AI actually make an operations team measurably better, and where is it a marketing layer on top of a tool that already existed?

I have spent the last eighteen months evaluating, deploying, and in some cases quietly retiring AI-assisted tooling across infrastructure and security functions. This post is my honest field report. Not dismissive, not evangelical. Specific.

What is actually working in production

These are use cases where I have seen consistent, measurable value.

LLM-assisted triage of tier-one SOC alerts

Narrow usage: enrichment, context, and summary. The model is excellent at pulling together the first five tabs an analyst would open for a given alert — asset owner, recent changes, related tickets, external reputation data. The analyst still makes the call. Time-to-triage drops meaningfully. False-positive fatigue drops more.

Where it goes wrong is the moment someone tries to let the model decide. Tier-one decisions require knowing when to escalate, and the model has no calibrated sense of organizational context. It will close things it should escalate and escalate things a senior analyst would close in ten seconds.

Change ticket summarization

Most organizations have change advisory boards that read between forty and two hundred tickets per week. A well-prompted summarizer condenses each ticket to a three-line executive brief: what, why, risk, rollback. Board meetings get shorter and the discussion gets sharper. This is low-risk, high-leverage, and an easy early win.

Policy document drafting and review

Writing a first draft of an acceptable-use policy, an incident response plan, or a vendor risk questionnaire from a clean sheet is slow. Producing a first draft with an LLM, then having a human subject matter expert edit it, is dramatically faster and typically more complete than starting from scratch. The reviewer still does the thinking. The model does the typing.

Runbook drafting from post-mortems

Feed the model the post-mortem document and ask it to draft a runbook that would have prevented or shortened the incident. The output is uneven but frequently catches steps a tired human writer forgets. The engineer still has the last word.

Code and IaC review

As a first-pass reviewer on Terraform, Bicep, or infrastructure scripts, the model catches a meaningful number of simple mistakes: unpinned versions, missing tags, overly permissive IAM, forgotten backups. It is not a substitute for a human reviewer. It is a useful pre-filter.

Documentation maintenance

The single most underrated win. Keeping internal documentation in sync with actual infrastructure is something no team genuinely does well. Generative AI, pointed at the delta between infrastructure state and the docs, produces pull requests that humans then approve. Documentation quality improves. Engineer time spent on documentation drops.

What is hype

And here is where I will be unpopular with some vendors.

"Fully autonomous SOC"

No. Not at the current state of the technology, and not within the next two years by any reasonable read of the trajectory. Humans remain essential for anything above tier-two. Adversaries adapt. Models drift. The accountability question — who is responsible when the autonomous SOC makes the wrong call during a real incident — has not been answered, and until it is, autonomous SOC claims should be read as marketing.

AI-written code pushed to production without human review

No. Not in regulated industries. Not in infrastructure. Not anywhere that cares about correctness more than velocity. The model is a force multiplier for a human reviewer, not a replacement for one.

"AI-powered" where AI is a thin LLM wrapper

If the core value of the product is the same as it was before the AI label was added, the label does not add value. Ask the vendor what specific decisions the AI makes, on what inputs, with what evaluation. If they cannot answer precisely, it is branding.

Predictive outage claims

"Our AI predicts outages before they happen" is, in most cases, a lightly-repackaged anomaly detection product. Real predictive value is possible but rare, and it typically requires years of clean telemetry and a narrow-enough domain. Evaluate these claims the way you would evaluate a weather forecast: over a long time window, against a ground truth, with false positives counted honestly.

The governance problems nobody wants to own

This is where the real work lives, and where most organizations have not yet done the thinking.

Prompt injection. If your AI agent reads ticket content, and a ticket contains attacker-controlled text, that text can rewrite the agent's instructions. This is not theoretical. It is well-documented.
Data leakage to foundation model providers. Sending production data to external APIs is a data handling decision that most organizations have not formally classified.
Model provenance. Which model, from which provider, at which version, with which training cutoff, evaluated by whom. None of these are trivial in a regulated environment.
Auditability. If an AI-assisted decision shows up in a compliance audit, can you reconstruct what the model saw, what it said, and what the human did with it?

These are not blockers. They are homework. Organizations that skip them are accumulating policy debt they will have to pay back later, usually after an incident.

Where I see real leverage in the next twelve to twenty-four months

Automated change documentation. High leverage, low risk. This should be widespread by the end of the year.
Post-incident timeline construction. Pulling a coherent chronology out of logs, chat, and ticket systems is slow human work and a natural fit for a model.
Compliance evidence gathering. Collecting and formatting evidence for audits is an enormous annual tax on security and IT teams. It is exactly the kind of structured-but-boring work that LLMs accelerate.
Threat intelligence summarization. Condensing twenty vendor reports and blog posts into a single weekly digest, with sources cited. Low stakes, high time-saving.

Notice what these have in common. They are all cases where the model drafts and a human decides. That is the pattern that works. The cases that fail are the ones where the model is the decision-maker.

A short operating rule

AI is a tool, not a strategy. Use it where it demonstrably saves time and preserves human judgment. Avoid it where it replaces judgment. When you cannot tell which side of the line a proposal falls on, that is the proposal to slow down on.

The organizations I see getting the most out of this technology are not the ones with the most ambitious deployments. They are the ones who picked three or four narrow use cases, integrated them carefully, measured the time saved, and only then expanded. That is unsexy. It is also how every productive wave of operational technology has played out, from virtualization to cloud. There is no reason to believe this one is different.