The first time a fintech actually needs its disaster recovery plan is usually the first time anyone reads it past page two. By that point the database is corrupt, the on-call engineer is improvising, and the executive on the call is asking questions that should have had answers six months ago. This post is about how to avoid being that team.
Two documents, two audiences
A working DR programme has two artefacts that often get confused. The first is the DR plan: a board-facing document that describes scope, objectives (RTO and RPO), governance, roles, and the testing schedule. It's what your auditor and the FSCA Joint Standard 2 of 2024 want to see. The second is the runbook: an operational document with step-by-step instructions, contact lists, command snippets, and decision trees. It's what the engineer reads at 2 a.m. while the production database is on fire.
Most SA fintechs we audit have the first document. Many of those documents have not been updated in over a year. Few have the second, and the ones that exist are usually a Confluence page that documents the system as it was at the time of the last reorganisation.
RTO and RPO: the numbers everyone agrees to and forgets
RTO (Recovery Time Objective) is the maximum acceptable time the system can be down. RPO (Recovery Point Objective) is the maximum acceptable amount of data loss measured in time. RTO 4 hours means "we must be back up within 4 hours of the incident starting." RPO 15 minutes means "we can tolerate losing up to 15 minutes of data."
For each critical system, you need a defensible RTO and RPO that the business has actually signed off on. "ASAP" is not an RTO. "Zero" is not an RPO unless you have synchronous multi-region replication and an entire budget conversation behind it.
For a typical SA fintech, sensible targets might be: customer-facing transaction systems RTO 1 hour, RPO 5 minutes. Internal admin systems RTO 8 hours, RPO 1 hour. Reporting and analytics RTO 24 hours, RPO 24 hours. Don't apply the strictest target to everything; you'll spend a fortune backing up data that doesn't need it.
The numbers drive everything else: how often you snapshot, how you architect backups, what your hot-standby looks like, what you put on retainer with vendors. Get the numbers wrong and you build the wrong system.
Backup architecture that actually survives
Three rules that resolve 80% of the design conversations:
3-2-1. At least three copies of important data, on at least two different storage media, with at least one off-site. Still the correct baseline 40 years after IBM wrote it down.
Immutability. At least one backup must be write-once-read-many for your retention period. If ransomware reaches your live system, it cannot reach your immutable backup. AWS S3 Object Lock and equivalent features at Azure and GCP are the production-grade way to achieve this.
Test restore, not just backup. A backup you have never restored is a backup that probably does not work. Restore drills, on a defined cadence, into a clean environment, with verification that the restored data is queryable and complete.
The most common failure pattern in fintech DR audits: backups are running, snapshots are accumulating, the dashboards are green, and the team has never restored a full system end-to-end. When the day comes, they discover the backup excluded the WAL files, or the encryption key rotation broke the older snapshots, or the restore takes 14 hours and the RTO is 4.
The runbook nobody wants to write
This is the document that turns a DR plan from a compliance exercise into an operational capability. A good runbook has, for each defined incident scenario:
Detection (how does someone know this is happening?). Triage (what's the first thing the on-call does to confirm and bound the problem?). Escalation (who do they call, in what order, with what authority?). Decision points (the explicit moments where someone has to choose between options, and who gets to make the call). Recovery steps (the specific commands, in order). Verification (how do you know the recovery worked?). Communication (what do customers, regulators, the press get told, when?).
The runbook is opinionated and specific. "Restore the database" is not a runbook step. "Trigger the most recent RDS snapshot via this AWS console URL, wait for status=available, run this SQL script to reindex the orders table, validate the row count matches the value in incident-bot Slack channel #ops-restore" is a runbook step.
Tabletop exercises and why they matter
A tabletop exercise is a structured walkthrough of an incident scenario with the team that would actually respond. Two to three hours, a facilitator, a written scenario that unfolds as the exercise progresses, no real systems touched. Each team member describes what they would do, who they would call, what they would say.
The tabletop surfaces three things that no document can. Gaps in the runbook (steps that turn out to be ambiguous, decisions that have no owner, contact details that are out of date). Coordination failures (the security lead thinks legal is calling the Regulator, legal thinks the CEO is, the CEO is on a flight). Authority gaps (the system that requires the CTO to approve a restore, when the CTO is the one whose laptop just got encrypted).
Twice a year minimum. The FSCA Joint Standard 2 of 2024 expects evidence of testing and rehearsal; "we have a plan" is not enough.
Need a DR programme that holds up?
Our DR engagements deliver a board-facing plan, an engineer-facing runbook, a backup architecture review, and a facilitated tabletop. Typically 2 to 6 weeks depending on scope. You walk away with documents your auditor will accept and capabilities your team will actually use.
See the DR serviceThe SA-specific exposures
Two failure modes we see disproportionately in South African fintechs:
Single-AZ deployments. Cost pressure pushes teams to run production in a single availability zone, with the assumption that "AWS Cape Town will be fine." It usually is. When it isn't, the cost saved on multi-AZ replication is dwarfed by the cost of being down for hours. For any system with RTO under 4 hours, multi-AZ is the floor, not the upgrade path.
Single-cloud-region with no out-of-region escape. Multi-AZ within Cape Town protects against AZ failure. It does not protect against a region-wide failure, a regulatory action, or a billing dispute that locks your account. For a fintech holding regulated data, an out-of-region cold standby (Frankfurt or Dublin are common) is the right level of paranoia.
The other recurring failure: assuming the cloud provider's SLA is your DR plan. The SLA gives you credits. Credits do not restore a corrupted production database. Your DR plan is yours, not your vendor's.
Key takeaways
- A working DR programme has two artefacts: a plan (for the auditor) and a runbook (for the engineer at 2 a.m.). You need both.
- RTO and RPO are business decisions that drive architecture. "ASAP" and "zero" are not RTOs.
- 3-2-1, immutability, and tested restores. Most DR failures are restore failures, not backup failures.
- The runbook has to be specific enough that the on-call engineer can execute it cold. "Restore the database" is not a runbook step.
- Tabletop exercises surface the gaps no document catches. Twice a year minimum; FSCA Joint Standard 2 of 2024 expects evidence of testing.
- For SA fintechs: single-AZ is not DR. Single-region without an out-of-region cold standby is not DR for regulated data.
The teams that recover well from incidents are not the teams with the prettiest plans. They are the teams that have rehearsed the response, kept the runbook current, and tested the restores. Everything else is paperwork.