Four pillars · Causal AI
Causal AI
Correlation tells you what happened. Causal AI tells you why, and whether a decision actually caused the outcome. This is the only rigorous way to know if an agent is improving rather than getting lucky, and the only way to attribute impact to the parts of a decision sequence that actually drove the result.
Autonomy's architecture rests on a single insight: the digital twin is the causal model. A well-calibrated twin is, by construction, a structural causal model in Pearl's sense, and counterfactual replay through it is causal inference, not statistical correlation.
"You cannot answer causal questions with statistical methods alone. To claim that an action caused an outcome, you need a causal model, not just data."
"The next frontier for AI is moving from pattern recognition to causal reasoning. Systems that understand cause and effect will be fundamentally more robust and trustworthy than those that merely learn correlations."
The attribution problem
An agent raises a purchase order. The customer's delivery arrives on time. Did the PO cause the on-time delivery, or would it have arrived on time from existing stock? Without answering this, you cannot know whether the agent made a good decision; you only know the outcome was good.
This is the fundamental problem with measuring decision quality by outcomes alone. Good outcomes can follow bad decisions (luck); bad outcomes can follow good decisions (variance). The only way to separate skill from luck is to ask: what would have happened if the agent had decided differently?
In supply chain that question is harder than it looks. Some outcomes are clean events with a single decision behind them, a PO arrives, a shipment lands on time, a line fills. Other outcomes emerge from the interaction of many decisions over many weeks, an inventory level on a given day, a resource's utilisation over a quarter, end-customer service across a region. The same attribution machinery cannot credibly answer both.
The twin is the causal model
A digital twin under PLAN_PRODUCTION mode encodes the physics of how supply chain decisions propagate to outcomes. Lead times, capacity constraints, BOM explosions, allocation rules, inventory dynamics , all of it. When the twin is well-calibrated, it is precisely what Judea Pearl calls a structural causal model: a system that lets you answer do-questions (“what happens if I intervene to do X?”) instead of just observational questions (“what tends to happen when X is observed?”).
That means Autonomy already has the central piece of Causal AI infrastructure other stacks have to build separately. The twin doesn't need to be augmented with a causal engine; the twin is the causal engine. The question becomes architectural: when do you run the twin to attribute a decision, and when is the twin the wrong tool?
"A causal model is a tool for answering questions about interventions. You ask: what would happen if I forced this variable to take a different value? The structural model gives the answer."
Three tiers of impact assessment
Each outcome class goes to the tier that can credibly measure it. The three tiers fail in different ways, so they compose rather than compete.
Each tier in detail
Tier 1, Direct observation
Some outcomes are clean events with one decision behind them. The agent commits to a PO with an expected receipt date and quantity. The receipt event eventually fires with an actual date and quantity. Impact is the difference between promise and observation; the attribution is unambiguous because no other agent decision could have changed this specific event's outcome.
The right outcome metrics for Tier 1 are line-level: on-time delivery, line fill, ATP commitment honoured, individual PO accuracy. These are observable in the ERP change-data-capture stream the moment the underlying event happens. No twin replay, no statistical inference, no roadmap dependency, this tier is live today and drives the per-decision BSC reward signal that feeds into agent training.
Tier 2, Twin-based counterfactual replay
Other outcomes are aggregate state at a future point: inventory level at a site on day 30, capacity utilisation on a constrained resource over a shift, fill-rate across a customer segment over a quarter. These outcomes can't be measured against any single decision because many decisions interact to produce them. The agent's order quantity, the carrier's lane choice, the production planner's sequencing, the finance team's expedite approval, all contribute.
The right attribution method is twin-based counterfactual replay. Take the actual decision history. Swap the agent's decision under test for the heuristic baseline (or the planner's override). Replay the twin forward from the swap point to the same future time. Diff the aggregate state. The difference, by construction, is the causal impact of that decision, this is Pearl's do-operator executed on the structural causal model the twin represents.
Today, the
agent_counterfactual_reward channel
computes a naive counterfactual for overridden decisions: the agent's
recommended action is substituted into the actually-observed environment and
re-scored. That gives a useful per-decision delta for override-effectiveness
calibration, but it is not a structural counterfactual, the twin is not
re-simulated, so downstream effects of the swap are not propagated.
A first structural counterfactual is live in narrow scope: inventory-buffer overrides in the SCP plane are now replayed forward through the digital twin with the agent's recommended buffer installed at the swap point. The naive path remains the default everywhere else, extending twin replay to ATP promises, PO-timing decisions, and the TMS / DP planes is the immediate next step, after which the multi-decision and multi-step extensions (a coherent policy change attributed as a single unit rather than as the sum of its individual swaps) become tractable.
Tier 3, Statistical causal estimators on observational data
Some outcomes are out of reach for the twin. The twin doesn't natively model customer churn dynamics, the cash conversion cycle, working-capital tax timing, or supplier-portfolio risk shifts. For these outcomes the right tool is observational causal inference: Bayesian structural time-series methods in the style of Google's CausalImpact, conditional average treatment effect (CATE) estimators, X-learners, synthetic-control methods over comparable sites.
These methods don't need a twin to run, but they benefit enormously from one: the twin replay from Tier 2 acts as a validation oracle, letting you check that a longitudinal causal estimator agrees with the structural model on outcomes the twin does cover. Where the two disagree, the estimator's assumptions get scrutinised.
Tier 3 is on the near-term roadmap. The data model already captures the longitudinal inputs these estimators need (decision history, outcomes, contextual variables); the estimator implementations are the work that remains.
Why three tiers, not one
A single attribution mechanism cannot honestly serve all outcome classes. The three tiers compose because they fail in different ways, and the failure modes don't overlap:
- Tier 1 fails when the outcome isn't a clean event. There is no “the” observation to point at.
- Tier 2 fails when the twin's coverage is incomplete. The twin doesn't model what it doesn't model.
- Tier 3 fails when the observational data is too thin, too noisy, or too confounded by unobserved variables.
By routing each metric to the tier that can credibly measure it, the BSC reward function ends up consuming the most trustworthy number for every outcome. Skill is told from luck on every metric, not just on the ones the simplest method happens to handle.
"Causal inference is essential for any system that makes decisions. If you're optimising based on correlations, you're optimising based on luck, and luck runs out."
Today versus near-term roadmap
Causal AI is one of Autonomy's Four Pillars. The architecture is in place; the implementation is staged.
What's built
- Tier 1 direct attribution for every decision with a clean event-level outcome (PO, TO, shipment, ATP).
- The digital twin runs in PLAN_PRODUCTION (deterministic) and TRAINING (stochastic) modes with the structural-causal-model invariants enforced at runtime.
- Naive single-decision counterfactual for overridden decisions: the agent's action is substituted into the observed environment and re-scored (the
agent_counterfactual_rewardfield). Twin replay is not yet wired in. - Override-effectiveness model per user, fed by the naive counterfactual, used for governance dashboards and execution-agent training-weight calibration.
- Conformal P10 / P50 / P90 bands around plans so impact is measured against probabilistic ground truth, not point forecasts.
What's coming
- Multi-decision block counterfactual replay (Tier 2 full): attribute a coherent policy shift, not just an individual decision.
- Multi-step horizon replay for aggregate state metrics (inventory levels, capacity utilisation, fill-rate).
- Longitudinal causal estimators (Tier 3): CausalImpact-style Bayesian structural time series, CATE estimators, X-learners.
- The twin used as the validation oracle for the Tier-3 estimators on the metrics it does cover.
"The ability to reason about interventions and counterfactuals is what separates genuine intelligence from pattern matching. A system that can ask 'what if I had done differently?' is fundamentally more capable than one that can only ask 'what happened?'"
Why this matters for autonomy
An autonomous system that learns from correlation will eventually reinforce the wrong behaviours. It will keep expediting when it shouldn't, keep building safety stock that isn't needed, keep making decisions that look good in hindsight but don't actually cause better outcomes. The system gets more confident in its habits while quietly drifting from skill into superstition.
Causal AI is what keeps autonomy honest. It is the mechanism by which we can claim that an agent is improving and mean it, that the agent is making decisions that cause better outcomes, not decisions that happen to coincide with them. Without it, you have automation that scales its own biases. With it, you have a system that can be trusted to take more responsibility over time.
This is why Causal AI is a foundational pillar of the shared world model, not an optional add-on. The full three-tier architecture is the difference between measuring outcomes and attributing them.
"Every autonomous system needs a way to tell skill from luck. Without causal reasoning, you end up with automation that reinforces its own biases."
See how Causal AI shapes agent learning
Walk through the impact-assessment loop from clean events to twin-based counterfactual replay to longitudinal estimators.