A notebook demo proves the model can do the task once. Production proves it can do the task reliably, observably, and within latency and cost budgets. The gap between those two things is where most LLM projects quietly die, and it is mostly evaluation and observability work, not modelling work. This post is about the discipline that closes the gap.
Why the notebook-to-production gap is where LLM projects die
The pattern is familiar. A team ships an impressive demo. Stakeholders sign off. The build moves from a Jupyter notebook to a service behind an API, and a different set of problems shows up. The model sometimes hallucinates customer-specific facts. P99 latency spikes the moment a user uploads a long document. Token bills are double what the proposal said. A weekend regression silently breaks a feature that worked on Monday. None of these are modelling failures. They are infrastructure failures that the notebook never had to handle.
The teams that ship cleanly treat the move to production as its own project with its own deliverables: an evaluation harness, an observability layer, a deployment strategy, and a documented compliance posture. Skip any one of those and you ship a product you cannot defend when something goes wrong.
Golden eval sets and regression suites that actually catch regressions
Start small and hand-curated. Hamel Husain's evaluation guide recommends a golden set in the order of 100 examples, covering core features, prior bug regressions, and known edge cases, with deterministic assertions on the hot path wherever possible. Binary pass/fail scoring is preferred over Likert scales because it forces clearer thinking, more consistent labels, and aligns with the binary nature of the decision in front of you: ship or fix.
Counter-intuitively, a 100% pass rate is usually bad news. Husain points out that a perfect score often means the test suite is too easy and the harness is not actually stressing the system. Something closer to 70% is a healthier signal that the cases on the edge of capability are being measured. Aim for a harness that hurts.
OpenAI's official evaluation guidance pushes the same idea further into production. Continuously evaluate. Log production traces. Sample real user interactions for human review. Grow the eval set over time rather than freezing it at launch. Every escaped defect is a new test case. The set is a living artefact, not a sign-off document.
Where LLM-as-judge helps, and where it quietly lies to you
LLM-as-judge is useful when prompt engineering has stopped yielding gains and you need a way to grade open-ended outputs at volume. It is dangerous when treated as ground truth. Peer-reviewed work on judge models shows systematic position bias: in pairwise evaluations, simply swapping the order of two candidate answers can flip the verdict, demonstrated across 15 judge models and over 150,000 evaluation instances on MTBench and DevBench. Judges also exhibit self-preference (rating their own family of models higher) and drift over time.
The practical rule. Reach for LLM-as-judge only after binary assertions and prompt fixes have run out, and only after you have calibrated the judge against a human-labelled subset. Log every disagreement between the judge and your humans, monitor that disagreement rate over time, and never let the judge be the only signal in front of a release gate. A judge can be a helpful instrument. It cannot be the referee.
Observability: traces, token accounting, latency P95 and P99, cost per call
Production observability for LLMs is its own discipline. Trace every call. The OpenTelemetry GenAI semantic conventions standardise attributes such as gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.finish_reasons, so traces, token accounting, and finish reasons can be captured consistently across vendors. Adopt those names from day one and your dashboards survive a model swap.
Track P95 and P99 latency separately, not just averages, and instrument time-to-first-token as its own metric (a streamed reply that takes eight seconds to start feels broken even if the total response time is fine). Account for input, output, and cached tokens distinctly because their pricing differs by an order of magnitude. Anthropic's prompt caching, for example, prices cache reads at 0.1x the base input token rate (a 90% discount) and 5-minute cache writes at 1.25x, which is the documented mechanism teams use to bring production cost-per-call down on long, repeated context.
And compute cost per successful task, not cost per call. A retry that fixes a bad answer is not free. An agent that consumes more tokens to get the right result is doing useful work. Anthropic's writeup on their multi-agent research system is blunt about this: their Opus 4 lead with Sonnet 4 subagents outperformed a single-agent Opus 4 by 90.2% on their internal research evaluation, but consumed roughly 15x more tokens per task. That trade is only legible if you have the accounting in place to see it.
Choosing a stack: LangSmith, Langfuse, Arize Phoenix and Weights and Biases
You do not need a bespoke platform. Four mature options cover most teams. LangSmith pairs naturally with LangChain and LangGraph but is hosted-only. Arize Phoenix is the well-regarded open-source tracing and evaluation library, lightweight to self-host, with strong notebook ergonomics. Weights and Biases extends its ML experiment-tracking heritage into LLM observability via Weave. Langfuse is the open-source contender that closed most of its feature gap in June 2025 by open-sourcing its previously commercial LLM-as-a-judge evaluations, annotation queues, prompt experiments, and Playground under MIT.
The right pick depends on data residency rules, team size, and whether you want self-host or SaaS. For South African builds where customer data cannot leave the country, self-hosted Langfuse or Phoenix tends to win. For a small team optimising for speed-to-dashboards, hosted LangSmith or Langfuse Cloud is faster to stand up. Whatever you pick, the OpenTelemetry GenAI conventions above make migration far less painful when you outgrow the first choice.
Want an LLM system you can actually run in production?
Our AI Workflow Automation builds ship with the eval harness, observability stack, and rollout plan in the same delivery. You get a system you can monitor, defend to the Information Regulator, and operate without surprise token bills.
See AI Workflow AutomationProgressive rollout: shadow mode, canary, A/B against a golden set
Do not flip the switch. Ship in three stages. First, shadow mode: the new system runs alongside the old (or alongside no system at all) and writes its outputs to logs without showing them to users. You harvest disagreement, mistakes, and latency outliers without any user impact. Second, a small canary: a fraction of real traffic gets the new system. Define the kill switch and the rollback criteria before you turn it on. Third, an A/B against the golden eval set in CI on every change, so a quiet regression in week six gets caught the same day it lands.
Anthropic's engineering writeup on their multi-agent research system reports that adding full production tracing was the change that unlocked diagnosing agent failures, and that they use rainbow deployments to shift traffic gradually so long-running agents are not disrupted mid-task. The lesson is the same for smaller systems. Long-running LLM calls do not survive blue-green flips cleanly. Gradual is correct.
Shipping safely in South Africa: POPIA, FSCA expectations and OWASP LLM Top 10
The regulatory layer is not optional and not retro-fittable. Three pieces matter most for SA builds.
POPIA section 71. Decisions based solely on automated processing (including profiling) that have legal or material effects on a data subject are restricted. The Act requires mechanisms for the person to make representations or get human review. If your LLM system declines a loan, prices a policy, or rejects a claim, section 71 applies directly, and "the model said no" is not a defensible answer on its own.
The FSCA and Prudential Authority joint AI report (November 2025) found 52% of South African banks and 50% of payment providers were actively using AI, drawing on more than 2,100 survey responses. It called for stronger model risk management, board-level oversight, and explainability techniques such as SHAP and LIME. If you sell into financial services here, your buyers will be asking for this evidence inside the next twelve months.
OWASP Top 10 for LLM Applications 2025. Prompt Injection remains LLM01. Sensitive Information Disclosure moved up to LLM02. Two new categories landed: System Prompt Leakage and Vector and Embedding Weaknesses (the latter is directly relevant to anyone running RAG). Treat the list as the minimum security review for any LLM feature, not as nice-to-have reading.
For broader risk framing, the NIST Generative AI Profile (NIST AI 600-1), released 26 July 2024, defines 12 GenAI-specific risk areas including Confabulation, Data Privacy, Information Integrity, and Information Security, mapped onto the AI RMF Govern, Map, Measure, Manage functions. It is the cleanest free framework to anchor a board-level AI risk policy against.
Key takeaways
- The notebook-to-production gap is mostly evaluation and observability work, not modelling work. A demo proves the task is possible once. Production proves it works reliably, within latency and cost budgets, and with the audit trail to defend.
- Start with a small, hand-curated golden set of around 100 examples. Prefer binary pass/fail assertions. Reach for LLM-as-judge only after prompt fixes stop helping, and calibrate the judge against human labels before trusting it.
- Treat LLM-as-judge as a flawed instrument. Position bias, self-preference, and scoring drift are documented across 15 judge models and 150,000+ evaluation instances. Log disagreements with humans, monitor drift, and never let the judge be the only signal in front of a release gate.
- Production observability is its own discipline. Adopt the OpenTelemetry GenAI conventions, track P95 and P99 latency and time-to-first-token, account for input, output, and cached tokens separately, and compute cost per successful task, not cost per call.
- Ship with shadow mode first, then a small canary, then a CI A/B against the golden set. Bake in the South African regulatory layer (POPIA section 71, the FSCA and PA AI report, OWASP LLM Top 10 2025) from day one rather than retro-fitting it after launch.
An LLM build that lacks the eval harness, the trace dashboard, and the rollout plan is a demo wearing a service uniform. Putting those pieces in is not glamorous work, but it is the difference between a system you operate with confidence and a system you hope no one looks at too closely.