The March of Nines: Why Enterprise AI Needs More Than a 90% Demo

“When you get a demo and something works 90% of the time, that’s just the first nine.” — Andrej Karpathy

The “March of Nines” captures a brutal reality in enterprise AI: reaching that impressive 90% demo success rate is only the beginning. Each additional nine in reliability—99%, 99.9%, 99.99%—demands engineering effort comparable to the first. For enterprise teams, the difference between “usually works” and “operates like dependable software” determines whether AI agents actually get adopted or gather dust in a demo folder.

The Compounding Math Behind the March of Nines

“Every single nine is the same amount of work.” — Andrej Karpathy

Agentic workflows compound failure. Consider a typical enterprise flow: intent parsing, context retrieval, planning, multiple tool calls, validation, formatting, and audit logging. With ten steps, each succeeding 90% of the time, your end-to-end success drops to just 35%. That’s right—two-thirds of your workflows fail.

The math is brutal: if each step succeeds with probability p, end-to-end success is approximately p^n. In a 10-step workflow at 90% per-step reliability, you’re looking at a 65% failure rate. At enterprise scale, that means dozens of interruptions daily.

Here’s what the compounding looks like in practice:

Per-step success (p)	10-step success (p^10)	Workflow failure rate	At 10 workflows/day	Practical Reality
90.00%	34.87%	65.13%	~6.5 interruptions/day	Prototype territory. Most workflows fail.
99.00%	90.44%	9.56%	~1 every 1.0 days	Demo-ready but still unreliable in production.
99.90%	99.00%	1.00%	~1 every 10.0 days	Still feels unreliable because misses remain common.
99.99%	99.90%	0.10%	~1 every 3.3 months	This is where it starts to feel like dependable enterprise-grade software.

Define Reliability as Measurable SLOs

“It makes a lot more sense to spend a bit more time to be more concrete in your prompts.” — Andrej Karpathy

Teams achieve higher nines by turning reliability into measurable objectives, then investing in controls that reduce variance. Start with a small set of Service Level Indicators (SLIs) that describe both model behavior and the surrounding system:

Workflow completion rate (success or explicit escalation)
Tool-call success rate within timeouts, with strict schema validation on inputs and outputs
Schema-valid output rate for every structured response (JSON/arguments)
Policy compliance rate (PII, secrets, and security constraints)
p95 end-to-end latency and cost per workflow
Fallback rate (safer model, cached data, or human review)

Set Service Level Objectives (SLOs) per workflow tier (low/medium/high impact) and manage an error budget so experiments stay controlled.

Nine Levers That Reliably Add Nines

1) Constrain Autonomy with an Explicit Workflow Graph

Reliability rises when the system has bounded states and deterministic handling for retries, timeouts, and terminal outcomes.

Model calls sit inside a state machine or a DAG, where each node defines allowed tools, max attempts, and a success predicate
Persist state with idempotent keys so retries are safe and debuggable

2) Enforce Contracts at Every Boundary

Most production failures start as interface drift: malformed JSON, missing fields, wrong units, or invented identifiers.

Use JSON Schema/protobuf for every structured output and validate server-side before any tool executes
Use enums, canonical IDs, and normalize time (ISO-8601 + timezone) and units (SI)

3) Layer Validators: Syntax, Semantics, Business Rules

Schema validation catches formatting. Semantic and business-rule checks prevent plausible answers that break systems.

Semantic checks: referential integrity, numeric bounds, permission checks, and deterministic joins by ID when available
Business rules: approvals for write actions, data residency constraints, and customer-tier constraints

4) Route by Risk Using Uncertainty Signals

High-impact actions deserve higher assurance. Risk-based routing turns uncertainty into a product feature.

Use confidence signals (classifiers, consistency checks, or a second-model verifier) to decide routing
Gate risky steps behind stronger models, additional verification, or human approval

5) Engineer Tool Calls Like Distributed Systems

Connectors and dependencies often dominate failure rates in agentic systems.

Apply per-tool timeouts, backoff with jitter, circuit breakers, and concurrency limits
Version tool schemas and validate tool responses to prevent silent breakage when APIs change

6) Make Retrieval Predictable and Observable

Retrieval quality determines how grounded your application will be. Treat it like a versioned data product with coverage metrics.

Track empty-retrieval rate, document freshness, and hit rate on labeled queries
Ship index changes with canaries, so you know if something will fail before it fails
Apply least-privilege access and redaction at the retrieval layer to reduce leakage risk

7) Build a Production Evaluation Pipeline

The later nines depend on finding rare failures quickly and preventing regressions.

Maintain an incident-driven golden set from production traffic and run it on every change
Run shadow mode and A/B canaries with automatic rollback on SLI regressions

8) Invest in Observability and Operational Response

Once failures become rare, the speed of diagnosis and remediation becomes the limiting factor.

Emit traces/spans per step, store redacted prompts and tool I/O with strong access controls, and classify every failure into a taxonomy
Use runbooks and “safe mode” toggles (disable risky tools, switch models, require human approval) for fast mitigation

9) Ship an Autonomy Slider with Deterministic Fallbacks

Fallible systems need supervision, and production software needs a safe way to dial autonomy up over time. Treat autonomy as a knob, not a switch, and make the safe path the default.

Default to read-only or reversible actions, require explicit confirmation (or approval workflows) for writes and irreversible operations
Build deterministic fallbacks: retrieval-only answers, cached responses, rules-based handlers, or escalation to human review when confidence is low
Expose per-tenant safe modes: disable risky tools/connectors, force a stronger model, lower temperature, and tighten timeouts during incidents
Design resumable handoffs: persist state, show the plan/diff, and let a reviewer approve and resume from the exact step with an idempotency key

Implementation Sketch: A Bounded Step Wrapper

A small wrapper around each model/tool step converts unpredictability into policy-driven control: strict validation, bounded retries, timeouts, telemetry, and explicit fallbacks.

python
def run_step(name, attempt_fn, validate_fn, *, max_attempts=3, timeout_s=15):

trace all retries under one span

span = start_span(name)
for attempt in range(1, max_attempts + 1):
    try:
        # bound latency so one step can't stall the workflow
        with deadline(timeout_s):
            out = attempt_fn()
        # gate: schema + semantic + business invariants
        validate_fn(out)
        # success path
        metric("step_success", name, attempt=attempt)
        return out
    except (TimeoutError, UpstreamError) as e:
        # transient: retry with jitter to avoid retry storms
        span.log({"attempt": attempt, "err": str(e)})
        sleep(jittered_backoff(attempt))
    except ValidationError as e:
        # bad output: retry once in "safer" mode (lower temp / stricter prompt)
        span.log({"attempt": attempt, "err": str(e)})
        out = attempt_fn(mode="safer")
# fallback: keep system safe when retries are exhausted
metric("step_fallback", name)
return EscalateToHuman(reason=f"{name} failed")

Why Enterprises Insist on the Later Nines

Reliability gaps translate into business risk. McKinsey’s 2025 global survey reports that 51% of organizations using AI experienced at least one negative consequence, and nearly one-third reported consequences tied to AI inaccuracy. These outcomes drive demand for stronger measurement, guardrails, and operational controls.

Closing Checklist

Pick a top workflow, define its completion SLO, and instrument terminal status codes
Add contracts + validators around every model output and tool input/output
Treat connectors and retrieval as first-class reliability work (timeouts, circuit breakers, canaries)
Route high-impact actions through higher assurance paths (verification or approval)
Turn every incident into a regression test in your golden set

The nines arrive through disciplined engineering: bounded workflows, strict interfaces, resilient dependencies, and fast operational learning loops.

Nikhil Mungel has been building distributed systems and AI teams at SaaS companies for more than 15 years.

Viral Phrases

“The March of Nines: Why Your 90% Demo Isn’t Enough”
“Each additional nine costs as much as the first”
“Prototype territory: where most workflows fail”
“Enterprise AI needs more than impressive demos”
“Reliability gaps translate into business risk”
“Turning uncertainty into a product feature”
“Treat autonomy as a knob, not a switch”
“Building AI that actually works in production”
“The brutal math of compound failure”
“Observability becomes the limiting factor”
“Shipping an autonomy slider with deterministic fallbacks”
“Why enterprises insist on the later nines”
“Disciplined engineering for reliable AI”
“From impressive demo to dependable software”
“The nine levers that reliably add nines”

Karpathy’s March of Nines shows why 90% AI reliability isn’t even close to enough