Build a document agent in a Friday afternoon tutorial and it works. You point LangChain at five PDFs, drop the chunks into Chroma, wrap it in a chat UI, and your demo answers questions about your sample data with confidence. The CTO is impressed. The board is impressed. Two weeks later you point it at the production document store and the wheels come off.
This is the post we wish someone had handed us before our first production RAG build. Six things you will hit, in roughly the order you will hit them, and what to do about each one.
1. Your documents are not documents
The first surprise. The "document store" you've been told is the source of truth is rarely a set of clean PDFs. In production it is, in roughly equal measure: PDFs from the actual policy team, Word docs that have been saved as PDF, Word docs that were never saved as PDF, scanned faxes from 2017, photos of printouts taken on a phone, screenshots of WhatsApp messages, and email threads where the actual content is in an attachment that is itself an image.
Before you chunk anything you need a pipeline that classifies what each file actually is and routes it. Native-text PDF goes to extraction. Scanned PDF goes to OCR (we use Azure Document Intelligence or AWS Textract; the open-source OCR options are a year behind on real-world quality). Image attachments get the same OCR path. Word docs go through their own path because the conversion-to-PDF loses formatting that often contains the actual structure of the document.
This sounds boring. It is boring. It is also where most production RAG systems fail before they start, because the team built the retrieval layer first and tried to backfill the ingestion layer when the demo broke.
2. Chunking strategy is the whole game
Tutorials tell you to chunk at 512 tokens with 50-token overlap. That is fine for Wikipedia. It is wrong for client data.
A typical fintech policy document has structure: a table of contents, numbered sections, sub-clauses, defined terms. The semantic unit of "what's the procedure for adverse media review?" is a labelled sub-clause that might be 80 tokens or 1,200 tokens. Cutting it at 512 splits the answer.
What works in production: structure-aware chunking. Use a parser that respects headings, lists, and tables. Chunk at the smallest unit that's semantically complete (often a sub-clause). Include the full breadcrumb of parent headings in the chunk's metadata so the retriever sees "Section 4.3.2: Source of Funds Verification" attached to the body. Re-rank with that breadcrumb in the query.
For tables: extract them separately, store them as JSON, and let the retrieval layer fetch a whole table when the question is about a number or a threshold. The worst answers we see in production are from systems that chunked a regulatory table mid-row.
3. Embeddings are not magic
"Just use OpenAI text-embedding-3-large" is the default, and for English-language general content it works. For specialised domains it under-performs in specific, predictable ways.
Legal and regulatory text uses defined terms ("the Account Holder," "the Responsible Person," "Eligible Customer") that are semantically identical across documents but lexically scattered. Pure embedding retrieval will miss these. So will pure keyword search.
The fix is hybrid retrieval. Run BM25 (or equivalent lexical) and embedding similarity in parallel. Merge with a re-ranker. Cohere Rerank is the obvious commercial option; cross-encoders from sentence-transformers are the open-source path. Either way, the re-ranker is what saves you from confident-but-wrong retrieval.
For Afrikaans, isiZulu, or mixed-language content (common in SA fintech), test embedding quality before you commit. Multilingual-specific embedding models from providers like Cohere typically outperform a generic English-first embedder on non-English content. Always benchmark on a sample of your own data.
4. Contradictions are the rule, not the exception
This is the one nobody warns you about. When you index the real client document store, you will find that the same question has different answers in different documents. The 2019 policy says one thing. The 2022 amendment says another. The internal procedure document says a third thing. The training deck the team uses says a fourth.
Your retriever will dutifully return chunks from all four. Your LLM will, depending on the day, pick one and confidently report it. Production users will notice within a week.
The fix is in the index, not the model. Tag every chunk with its source document's metadata: effective date, document type, version, status (current/superseded/draft). At query time, filter or rank by these. A simple heuristic that works: prefer current over superseded, prefer most-recent effective date, prefer policy over procedure over training material, and surface conflicts to the user when they exist instead of resolving silently.
Need a document agent that handles your real data?
We build document and knowledge agents scoped to your actual content, with hybrid retrieval, contradiction handling, and traceable citations. Scoped per workflow, typically four to eight weeks to a production system.
See Document & Knowledge Agents5. Citations are not optional
A document agent without citations is a hallucination machine with a search bar. Every answer the agent gives must come back with the source chunks it used, the document name, and the page or section reference. Click-through to the highlighted source in the original document is the production-grade bar.
This matters for three reasons. Users learn to trust the system by spot-checking citations. Compliance teams can audit decisions after the fact. When the agent is wrong (and it will be), the citation tells you whether the retrieval failed or the generation failed, which determines what you fix.
Build citations into the system from day one. Retrofitting them is painful because it touches both the retriever (you need source metadata on every chunk) and the generator (you need a prompt that forces structured output with cited spans).
6. Evaluation is what makes it production-ready
"It works on the demo questions" is not a production bar. Real evaluation in production document AI looks like this: a golden set of 50 to 200 questions, each labelled with the correct answer and the correct source. The questions are written by domain experts (your compliance team, your customer-ops lead), not the engineering team. They include adversarial cases, ambiguous cases, and the boring 80% of routine questions.
Every change to the system (new chunking strategy, new embedding model, new prompt, new re-ranker, new LLM version) gets evaluated against this set before it ships. You measure retrieval recall ("did we get the right chunks?"), answer accuracy ("did the LLM say the right thing given the chunks?"), and citation correctness ("are the cited sources actually the ones it used?").
Without this, you are flying blind. Every time the LLM vendor changes a model in the background (which they do, without warning), you find out from production users instead of from a regression test. That is not a place you want to be at month four.
Key takeaways
- Ingestion is the hardest part. Real document stores are heterogeneous and need routing, OCR, and format-aware extraction before chunking.
- Chunk on document structure, not token count. Keep section breadcrumbs in metadata.
- Hybrid retrieval (BM25 plus embeddings) plus a re-ranker is the production baseline.
- Production data contains contradictions. Solve this in the index with date and status metadata, not in the prompt.
- Citations are non-negotiable. Build them in from day one.
- A golden eval set written by domain experts is what separates a demo from a production system.
Document agents are one of the highest-ROI AI builds available right now. Done well they collapse hours of analyst reading into minutes of review. Done badly they create new compliance liabilities and a chatbot the team learns not to trust. The difference is six layers of engineering most tutorials skip.