The demos are incredible. The agent watches a bug report, clones the repo, reproduces the issue, writes a fix, opens a PR. Twenty seconds, end to end. Everyone in the room leans forward.

Then you deploy it in production. The agent works beautifully on the happy path. On step four of a twelve-step task, it makes a small mistake. It does not notice. It proceeds on the assumption that step four succeeded. By step twelve, it has created a pull request against the wrong branch, modified unrelated files, and closed the original ticket with a confident summary of something that did not happen.

Agentic AI is real. It is also, in early 2026, much more fragile than the demos suggest. Here is how I think about where it earns its keep and where it burns you.

The core problem: error cascades

A one-shot LLM call has one chance to be wrong. An agent making twenty tool calls in sequence has twenty chances. If each step is 95 percent reliable (which is optimistic for anything non-trivial), the end-to-end success rate is 36 percent. The math is unkind and the math does not care how good the underlying model is.

Models do not recover well from their own mistakes. If the agent called a tool with the wrong argument and got a cryptic error, the typical response is to try a variation, get another cryptic error, and eventually hallucinate a workaround that is worse than both. Watching this happen in real time is educational in a way that no vendor demo is.

Toolchain complexity

Every tool you add to an agent's toolbelt is another surface for errors. The tool's schema has to match what the model expects. The tool's error messages have to be intelligible to a model. The tool's side effects have to be reversible or gated. Most internal APIs were not designed for any of this. You end up writing thin wrappers per tool, and those wrappers become the thing you maintain.

Frameworks like LangChain and LlamaIndex help with the glue. They do not help with the fact that your internal ServiceNow instance returns HTML error pages to API calls under certain conditions, or that your ticketing system has rate limits the agent will happily blow through.

Scoped agents versus open-ended

The agent deployments that work have a narrow, well-defined scope. "Given an alert, gather these five enrichment signals, write a triage summary, and set one of four labels." That is an agent. It has a bounded action space, predictable inputs, and clear success criteria.

"Be my IT assistant" is not an agent. It is a vibe. Open-ended agents fail in ways that are hard to debug because the failure mode is the combinatorial explosion of the action space. Narrow the scope. Narrow it again. When you think you have narrowed it enough, narrow it once more. That version might ship.

Audit trails are not optional

If an agent can take actions that affect systems, every action needs to be logged with enough context to reconstruct what happened. The prompt at each step, the tool call made, the response returned, the model's reasoning if you can extract it. When something goes wrong, and it will, you need to walk back through the decision tree.

This is also what makes incident response on agentic systems tractable. A developer can look at the trace, identify the step that went sideways, and either fix the prompt, fix the tool, or add a guard. Without the trace, you are guessing.

Human gates for blast radius

Agents should be allowed to act autonomously on reversible, low-stakes actions. Commenting on a ticket, tagging a message, querying a read-only API. Anything that cannot be easily undone needs a human in the loop. Creating tickets, modifying records, running scripts, sending external email.

This is boring, and it is also the reason the systems we run have not had a public incident. The overhead of a human clicking confirm is small. The cost of an autonomous agent modifying the wrong customer record is not small.

When agents are worth the complexity

Three conditions need to hold: the task is repetitive enough that automating it saves real time, the scope is narrow enough that the failure modes are understandable, and the stakes are low enough that an occasional failure is recoverable. Triage summarization, routine enrichment, first-pass classification. The space where agents earn their keep today is unglamorous.

The more ambitious stuff will get there. Probably not this year. Definitely not this quarter. If a vendor is selling you an "autonomous IT operations agent" that replaces your ops team, ask to see their incident log. If they do not have one, they do not have production.