RAG for Internal Docs: The Boring, Useful Pattern

If you have to pick one AI pattern to deploy in an enterprise this year, pick retrieval-augmented generation over your internal documentation. It does not demo well. Nobody will write a Medium post about it. It will pay for itself inside a quarter and keep doing so.

I have now deployed three variants of this in different contexts. The failure modes are predictable. The wins are durable.

Pick the embedding model second

Engineers reach for the embedding model first. It is the wrong starting point. The right starting point is the corpus. What documents? How are they chunked? What metadata do you have? If you cannot answer those, no embedding model will save you.

That said: for most enterprise English-language corpora, a mid-sized open model like BGE or one of the instruction-tuned variants gets you within a few percent of the closed-source options at a fraction of the cost and with full data residency control. Use OpenAI's or Voyage's embeddings if you are already paying for that ecosystem. Do not agonize.

Chunking is where the accuracy lives

Naive fixed-size chunking (say, 512 tokens with 50 overlap) is the default in every LangChain tutorial and it is the default for a reason. It works okay. Okay is not great.

What moved the needle for us was respecting document structure. Split on headings first, paragraphs second, sentences third. Keep the heading breadcrumb in the chunk metadata so "Step 3" under "Rollback Procedure" is not mistaken for "Step 3" under "Deployment." A chunk without its hierarchical context is frequently nonsense.

The other trick: for long structured documents (runbooks, policy manuals), index at two granularities. Small chunks for retrieval precision, then fetch the parent section for context at generation time. LlamaIndex calls this "parent document retrieval." It is worth the complexity.

Metadata filtering is the underrated superpower

A user asks "how do I provision a new VPN account." If your index is the entire company knowledge base, you will retrieve the provisioning doc, the decommissioning doc, the 2019 legacy VPN doc, and a marketing blog. Vector similarity does not distinguish between them well.

Tag every chunk with source system, document type, last-reviewed date, audience, and region. Filter before you rank. A query from a US employee about current VPN procedures should never see the 2019 doc or the EU-only policy. This single change improved our answer quality more than any embedding model upgrade.

What "accuracy" means in RAG

Accuracy is not a single number. It decomposes into retrieval quality (did we find the right chunk) and generation quality (did we use it correctly). Measure them separately. A system that retrieves well and generates badly needs different fixes than the reverse.

We maintain a test set of roughly 200 hand-labeled question-and-expected-source pairs. Every embedding or prompt change runs against it. The number we optimize for is "did the correct source appear in the top 5 retrieved chunks." Generation quality we sample manually. Automating LLM-as-judge evals is tempting and we use it, but for a system that holds up over quarters, humans look at a sample of real production queries every week.

Stale docs will eat you

Your corpus is 30 percent wrong the day you build the index. Within six months it is closer to 50 percent. No vector database solves this. The only defense is surfacing freshness: date every chunk, show the date in the answer, flag old content. Pair that with a weekly report of "most retrieved docs that have not been edited in a year" and hand it to the teams that own them. The model becomes a lever for documentation hygiene.

When you do not need RAG

Sometimes the right answer is "give them better search." If your users are technical, know what they are looking for, and just need to find a document quickly, RAG adds latency, cost, and a hallucination risk over what a well-tuned Elasticsearch would give you. The LLM earns its keep when users cannot articulate the query precisely, when answers span multiple documents, or when synthesis matters. If those conditions do not hold, ship search and move on.