AI Engineering ← Writing

AI cost optimization for South African businesses: what actually moves the bill

ZAR revenue, USD-priced tokens. Here's how South African businesses cut LLM spend without cutting quality: model tiering, prompt caching, the Batch API, semantic caches, and the governance that keeps finance and POPIA both happy.

Most South African AI bills are not big because the work is hard. They are big because nobody mapped the workloads to the right tier, nobody turned on prompt caching, and nobody questioned whether every request needed to be synchronous. This post is the version of the conversation we have with finance teams in Johannesburg and Cape Town every week. No vendor switching, no theatre, just the levers that actually shift the line item.

Why ZAR revenue and USD-priced tokens change the maths

Every major LLM API prices in US dollars. Your revenue, if you are a South African business, lands in rands. That gap is the part nobody flags until the month-end invoice comes through and the rand has moved a few percent the wrong way. A workload that looked fine at R0.40 to the cent last quarter is suddenly 8 percent more expensive without anyone touching the code.

This means two things in practice. First, your unit economics need to be modelled in USD at the workload level, then translated, not the other way round. If you only see the ZAR total, you cannot tell whether your spend moved because traffic grew, because token usage per request grew, or because the rand weakened. Second, the cost levers that compound (caching, batching, tiering) matter more in a weak-currency context than they do in San Francisco, because every percent you save is also a percent of FX exposure removed.

Token accounting basics every SA finance team should run

Before you optimise anything, you need three numbers per workload. Input tokens per request. Output tokens per request. Requests per day. That is it. Multiply those by the model's input and output prices, sum across workloads, and you have a defensible bill projection in USD. Everything else (caching, batching, tiering) is a modifier on top.

The mistake most teams make is treating the LLM bill as one number. It is not. It is the sum of distinct workloads (customer support chat, document summarisation, internal search, drafting, classification) with very different tolerances. A customer-facing chat needs latency. A nightly bulk classification does not. A weekly compliance review can wait twelve hours. Each of those constraints maps to a different pricing path, and the savings come from refusing to pay synchronous, top-tier prices for asynchronous, simple work.

Model tiering: Haiku, Sonnet, Opus, GPT 4o mini, GPT 4o

Tiering is the single largest lever. Anthropic publishes Claude Haiku 4.5 at $1 per million input tokens and $5 per million output tokens, Claude Sonnet 4.6 at $3 input and $15 output, and Claude Opus 4.7 at $5 input and $25 output. On the OpenAI side, GPT 4o is listed at $2.50 per million input tokens and $10 per million output tokens, and GPT 4o mini at $0.15 per million input and $0.60 per million output.

Read those numbers as a hierarchy, not a menu. Haiku 4.5 is roughly five times cheaper than Sonnet 4.6 on input tokens and twenty five times cheaper than Opus 4.7 on output tokens. GPT 4o mini is roughly seventeen times cheaper than GPT 4o on input. The right question is never "which is the best model". It is "what is the cheapest tier that meets the quality bar for this specific workload". Classification, extraction, routing, and templated summarisation usually run cleanly on the small tier. Reasoning, long-form drafting, and edge cases earn the mid tier. Opus and GPT 4o are for the hard 10 percent, not the easy 90.

There is also a quiet structural win on the high end. Claude Opus 4.7, Opus 4.6, and Sonnet 4.6 include the full 1 million token context window at standard per-token pricing, so a 900k token request is billed at the same rate as a 9k token request. If you have been splitting long documents into chunks to avoid a context-window surcharge that no longer exists, you are paying for plumbing you do not need.

Prompt caching and the Batch API: where the real discounts live

Once tiering is right, the next two levers are caching and batching. Both are quiet, both are large, and both are widely underused in South African deployments.

Anthropic prompt caching charges 1.25x base input for a 5 minute cache write, 2x for a 1 hour write, and 0.1x base input for a cache read. That last number is the headline: a 90 percent discount on cached tokens. For a system prompt or retrieval context that stays stable across thousands of requests in a session window, you pay the write once and then read at one tenth of the input price for the duration. The Anthropic Batch API delivers a further 50 percent discount on both input and output tokens for asynchronous workloads that can wait up to 24 hours, and the discount stacks with prompt caching multipliers. A high traffic SA support assistant can plausibly run at a fraction of its uncached synchronous cost simply by combining the two.

OpenAI applies automatic prompt caching on supported models for prompt prefixes of at least 1,024 tokens, billing cached input at a reduced rate without any code change. The implication for design is direct: put your stable content (instructions, schemas, retrieval context) at the top of the prompt and your variable content (the user message) at the bottom. Teams that rewrite or shuffle their system prompts per request are paying full price for tokens that could be cached for free. On the asynchronous side, the OpenAI Batch API processes requests within 24 hours at 50 percent of standard input and output pricing, with up to 50,000 requests per batch. Anything that does not need to be synchronous should not be.

Context window discipline, embeddings, and semantic caching

The cheapest token is the one you never send. Context window discipline (retrieving only what you need, summarising long histories, truncating noise) usually outperforms model swaps on the input side of the bill. The fact that the high-end Claude models now offer 1 million tokens at flat rates is not an invitation to stuff the context; it is an invitation to stop fearing the rare long document and tighten the common case.

Embeddings are rarely where the money goes, but they are the foundation for the lever that follows. OpenAI text embedding 3 small is priced at $0.02 per million input tokens and text embedding 3 large at $0.13 per million, with a further 50 percent discount via the Batch API. At that price, embedding your entire knowledge base, your support tickets, and your FAQ history is a rounding error on most SA bills.

What you do with those embeddings is the lever. A semantic cache sits in front of the LLM and serves answers from a vector indexed store when a new query is semantically similar to one already answered. Peer reviewed research on GPT Semantic Cache reports up to 68.8 percent reduction in API calls by serving semantically similar queries from a vector indexed cache. In a South African customer support context, where the same questions about delivery, refunds, account access, and product specs repeat constantly, a semantic cache can turn the bottom half of your traffic into near-zero marginal cost.

Sample monthly cost scenarios for a Johannesburg SME

Consider a Joburg SME running a customer-facing assistant. Every chat turn carries a system prompt and a retrieved context block at the top, then the user's message. If those prefixes are stable and long enough to qualify for caching, the bulk of the input tokens land at the discounted cached rate rather than the full rate. If the same SME also runs a nightly summarisation of the day's tickets, that workload is a clean fit for the Batch API at half price.

Now add tiering. Route the simple classification step (is this billing, technical, or sales) to Haiku 4.5 or GPT 4o mini. Route the customer-facing reply, where tone matters, to Sonnet 4.6. Reserve Opus 4.7 or GPT 4o for the small slice of escalations that genuinely need it. The same business value gets delivered, but the bill stops looking like every request paid the top tier.

The point is not the precise rand figure. The point is that three independent levers (tiering, caching, batching) combine multiplicatively. Halving each one of them is not a 50 percent saving; it is a cumulative compound that rewrites the unit economics.

POPIA, OWASP, and governing AI spend without breaking the law

Cost discipline is also a security and compliance control. The OWASP Top 10 for LLM Applications 2025 lists Unbounded Consumption as a distinct risk category, where attackers or misconfigured clients drive excessive model usage and cost amplification. Rate limits, per-tenant quotas, and circuit breakers are not just engineering hygiene. They are the difference between a normal month and a finance incident triggered by a single misbehaving client or a scripted abuse attempt.

POPIA pulls in the other direction at the same time. Section 72 of POPIA restricts transfers of personal information outside South Africa unless the recipient is bound by laws or agreements offering protection substantially similar to POPIA, which applies when SA businesses send personal data to overseas hosted LLM APIs. That has cost implications. If your cheapest path is a US-hosted endpoint, you still need the lawful basis and contractual posture to send the data there. Cost optimisation that skips compliance is not cheap. It is a deferred fine.

The NIST AI Risk Management Framework defines four functions, Govern, Map, Measure, and Manage, which provide a voluntary structure for cost aware AI governance including resource and consumption tracking. The "Measure" function in particular is where the cost story and the risk story meet: if you cannot measure your token consumption per workload, per tenant, and per model, you cannot govern it, and you cannot defend the choice to the board or the Regulator either.

Want a defensible AI cost baseline?

Our AI Audit engagement maps every workload to a tier, identifies caching and batching candidates, and gives you a documented baseline you can hand to finance, the board, or your Information Officer. POPIA-aligned, OWASP-aware, ZAR-honest.

See the AI Audit service

Key takeaways

  • The fastest way for a South African SME to cut an AI bill is not switching vendors, it is matching each workload to the cheapest tier that meets quality, since Haiku 4.5 is roughly five times cheaper than Sonnet 4.6 on input tokens and twenty five times cheaper than Opus 4.7 on output tokens.
  • Prompt caching on Anthropic gives a 90 percent discount on cached reads and stacks with the 50 percent Batch API discount, which means a high traffic SA support assistant can plausibly run at a fraction of its uncached synchronous cost.
  • OpenAI applies prompt caching automatically on supported models for prompts above 1024 tokens, so SA teams should design system prompts and retrieval context to maximise stable prefixes rather than rewriting them per request.
  • Embedding cost is rarely the issue, but a semantic cache layered on top of embeddings can deflect a large share of repeat queries, with published research showing up to around 68 percent fewer LLM calls in repetitive workloads.
  • Cost discipline is also a security and compliance control, since OWASP flags Unbounded Consumption as a real risk and POPIA Section 72 makes uncontrolled cross border calls to overseas LLMs a regulatory exposure as well as a budget one.

An AI bill that nobody can explain is an AI bill that nobody can defend, either to finance or to the Regulator. The good news is that the levers that actually move the number (tiering, caching, batching, semantic caching, governance) are all available to any South African business without a vendor change. They just have to be wired into the build on purpose, not bolted on after the invoice lands.

RelatedMore writing
AI Strategy12 min read

Why every fintech needs an AI audit before they buy a platform

A paid two-week audit costs less than the first month's licence and tells you whether the platform actually fits.

Read post →
AI Engineering11 min read

AI workflow automation for South African SMEs

Where automation actually pays back for SA SMEs, and where the spend quietly outruns the saving.

Read post →
Defensible AI

Cheaper, on purpose.

Tiered, cached, batched, governed. A defensible AI cost baseline in ZAR, modelled in USD, designed for POPIA.

Book a discovery call See Services