The hardest decision in building a production AI decision system is not which model to use or how to prompt it. It is where in the workflow the human goes. Get that wrong and you ship one of two failure modes: an automated system that the team has lost confidence in and routes around, or a "decision support" tool the team uses as a fancy autocomplete while still doing the work the old way.
This is a working post about the three patterns we ship most often. We'll cover what each one looks like, when each one breaks, and the three questions we ask before choosing.
Pattern 1: Auto-decide with audit (the green path)
The AI makes the decision. The decision is logged with the inputs, the model's reasoning, and the outcome. A human reviews a sample of decisions periodically (daily for high-risk flows, weekly for low-risk).
This is the right pattern when three conditions hold. The decision is reversible, the volume is high enough that human review of every case is prohibitive, and the error cost is bounded. Examples: routing inbound support tickets to the right queue, categorising transactions for a personal-finance app, flagging duplicate invoices for review, auto-responding to FAQ-shaped queries.
It breaks when one of those conditions stops being true. A model that handled ticket routing well at 1,000 a day starts mis-routing as your product range grows and the team doesn't notice for three weeks because nobody is reviewing the queue. Or the "low risk" categorisation turns out to feed into a regulatory report and now the auditor wants to know who signed off on every decision.
The fix is in the audit layer, not the model. The audit has to be active, not passive. A weekly digest the team actually opens. Drift detection on the model's confidence distribution. An on-call who looks at flagged cases within 24 hours. Without those, "auto-decide with audit" becomes "auto-decide" and you find out about the failure mode from a customer complaint.
Pattern 2: Recommend and route (the yellow path)
The AI proposes a decision with reasoning. The human reviews and confirms, edits, or rejects. The system records the human's verdict and (if you're disciplined about it) feeds that back as training signal.
This is the workhorse pattern for regulated workflows. KYC adjudication, underwriting reviews, claims triage, compliance escalations. The model does the reading, the lookup, the cross-reference. It produces a draft decision and a justification. The human reviews and decides.
When it works it delivers genuine productivity gains. A well-designed recommend-and-route system can collapse the time per case substantially, because most analyst time goes on gathering inputs and writing justifications, not on deciding.
It breaks in two ways. The first is automation bias. If the model is right 90% of the time, humans learn to confirm without reading. The 10% of wrong decisions go through with a human's name on them. You can detect this in the data (review time drops below the time it physically takes to read the draft) but only if you instrument for it.
The second is the opposite: the human ignores the recommendation entirely and re-does the work. Usually because the recommendation is unreliable, the UI doesn't surface the reasoning well, or the team didn't trust the system after a bad rollout. Either way you've spent on building a tool nobody uses.
The mitigations: build the review UI for skimming the reasoning, not just the verdict. Track override rates per reviewer and per decision type. A very low override rate is a signal of automation bias rather than perfect accuracy. A very high override rate means the model isn't earning its keep and needs retraining, scope narrowing, or removal.
Pattern 3: Surface and stop (the red path)
The AI does not propose a decision at all. It surfaces the relevant information, the relevant precedents, the conflicting policies, and stops. The human decides cold.
This is the right pattern for decisions where the cost of a wrong decision is very high, where the law or the regulator requires explicit human accountability, or where the inputs are too varied for the model to be reliably better than chance. Examples in fintech: regulated advice under the FAIS Act (where a licensed FSP must remain accountable for the recommendation), declining a loan application, suspicious-activity reporting decisions, anything that ends in a customer-facing letter signed by a person.
The mistake here is treating it as "just a search tool." A good surface-and-stop system is genuinely useful: it pulls together five things the human would have pulled together themselves in 15 minutes. It tells the human what's missing. It flags inconsistencies in the inputs. It does the boring work and leaves the judgement.
It breaks when the team treats the AI's selection of "relevant information" as the decision. You see this when a surface-and-stop tool starts being used as an "auto-decide" tool by tired analysts on busy days. The fix is process and design: make the human's reasoning a required field, make the surfaced information clearly labelled as "inputs to your decision" rather than "the answer," and audit the reasoning fields, not just the decisions.
Need help designing the human/AI boundary?
Our Decision Systems engagements design the workflow, the model, the UI, and the audit layer together. Scoped per workflow, typically four to ten weeks to a production decision system with a defensible human-in-the-loop design.
See Decision SystemsHow we choose between them
Three questions, asked in order.
1. Is the decision reversible? A wrong ticket routing is reversible in minutes. A wrong underwriting decision can take months. A wrong FAIS advice decision can land you in front of the FSCA. The less reversible, the more human the loop needs to be.
2. What is the cost of a wrong decision, and to whom? Customer cost, regulatory cost, financial cost, reputational cost. Sum them up. Compare to the cost of human review at the relevant volume. If wrong-decision cost is high and volume is moderate, pattern 2 or 3. If wrong-decision cost is low and volume is huge, pattern 1.
3. Does the law or the regulator require human accountability? For FAIS-regulated advice, yes (a licensed FSP carries the regulatory liability for the advice, regardless of how it was produced). For credit decisions, the National Credit Act's affordability assessment requirements apply whether the decision is human-made or automated, and regulators are increasingly scrutinising AI explainability in credit. For internal ops decisions, usually no. Don't try to engineer around this question. It is a constraint, not a preference.
Key takeaways
- Three patterns: auto-decide with audit, recommend and route, surface and stop. Each works in a specific zone.
- Auto-decide breaks when the audit becomes passive. The audit has to be a real, instrumented part of the system.
- Recommend-and-route breaks via automation bias (humans rubber-stamp) or under-trust (humans re-do the work). Track override rates to detect both.
- Surface-and-stop is the right pattern for high-stakes, low-volume, regulator-facing decisions. Design it so the reasoning is the artefact.
- Choose based on reversibility, error cost, and regulatory constraint, in that order.
- Most production decision systems are pattern 2. Pattern 1 is rarer than vendors imply. Pattern 3 is what keeps you out of trouble.
The interesting question in production AI is rarely "can we automate this?" It is "where, in this workflow, does a human's judgement add value the model can't, and how do we build the system so the human can actually exercise that judgement instead of rubber-stamping or being bypassed?" Get that right and you ship something the team uses for years. Get it wrong and you ship a tool the team works around within months.