Learning lifecycle
How Agents Learn
Autonomy agents arrive trained. The narrow per-decision execution agents are pre-trained on a synthetic generic corpus that the platform generates itself, no customer data is involved. The graph-based tier agents (Strategic L4, Tactical L3, Operational L2) are trained on your supply chain DAG before go-live, using a discrete event simulator that exposes them to many plausible realities of how your network behaves. The training data comes from deterministic engines modelled on ERP logic, acting as the Phase-1 teacher, but the agents are never given the heuristics' rules. They learn by watching the teacher play, the way a modern chess engine learns by watching games rather than by reading a rule book.
The result is agents that are competent on day one, calibrate against your live operations in the first weeks, and continue to improve under AI·IO·ML governance as your business evolves.
"This is not automation (same tasks, faster). It's inversion: the structural shift in who performs economic work."
"A smaller model with sufficient data outperforms a larger model with insufficient data on rule learning and generalization."
Two training tracks, one platform
Autonomy uses two different agent families, and they learn in different ways. Treating them as one undersells what's actually happening.
Execution agents, pre-trained on a generic corpus
The narrow per-decision models (Inventory Buffer, PO Creation, ATP, Forecast Baseline, Demand Sensing, Order Tracking, Rebalancing, and more) are trained once on a representative synthetic corpus that the platform generates itself. No customer data is involved in this pre-training. When a customer goes live, the right checkpoint is drawn from the registry and warm-started for that site.
From there each execution agent walks a three-phase curriculum that adapts to the customer's specifics. The generic corpus gives the model its decision shape; the local data teaches it the customer's context.
Tier agents, Trained on YOUR DAG
The graph agents (Strategic (L4) Policy Optimisation model, Tactical (L3) Domain-Model Reconciliation models, Operational Node Coordinator) cannot be warm-started generically. The whole point is that they learn the topology, lead-times, capacities, and substitution behaviour of your network. They are trained per-customer.
The training data is generated by a discrete event simulator that runs scenarios against the customer's actual supply chain DAG, with deterministic engines modelled on ERP logic acting as the teacher. No human supervision is required to produce this data.
The key technique
Learn by watching, not by being told the rules
The most important thing to understand about Autonomy's training is what the agents are not given. They are not given the formula for base-stock ordering. They are not given the EOQ equation. They are not given the safety-stock allocation rule. The deterministic heuristics modelled on ERP logic are the teacher, not the syllabus. The agents watch the teacher play through thousands of scenarios on the digital twin and learn the underlying decision shape from observation alone.
That distinction is everything. An agent that internalised the heuristic's rule could never beat the heuristic, at best, it would reproduce it. Because Autonomy's agents learn by watching outcomes across many scenarios, they discover interactions the rule cannot encode: between lead-time, demand shape, supplier reliability and seasonal regime. The teacher sets the floor. Phase 3 outcome optimisation lifts the agent above it.
"AlphaZero learned chess, shogi and Go to superhuman strength from random play, with no game-specific knowledge except the rules of the game."
The same training pattern in other complex domains
This is not a novel idea Autonomy invented. It is the dominant training pattern behind most of the last decade's machine-learning breakthroughs in domains too complex for hand-coded rules. A few of the cleanest examples:
AlphaZero · DeepMind, 2017
No opening books, no piece-value tables, no hand-coded position evaluators. Within hours of self-play, AlphaZero defeated the strongest classical chess engines that encoded decades of grandmaster knowledge as rules. Every modern top engine (Stockfish NNUE, Leela) now follows the same learn-from-games pattern.
Wayve, embodied AI for driving
Wayve's autonomous-driving foundation models are not given the rules of the road as policy. They learn end-to-end from camera input by watching human drivers and simulator rollouts, generalising across vehicle types and cities without rewriting the policy stack. The same learn-by-watching pattern underpinned Tesla's 2024 rewrite of FSD as an end-to-end network and Waymo's later-generation models.
DeepMind × EPFL Tokamak · 2022
A fusion-reactor controller for the TCV tokamak at EPFL Lausanne. Not a hand-coded control law, an RL agent that watched a plasma simulator respond to control actions, then was deployed onto a real reactor. Published in Nature; the same pattern Autonomy uses for tactical agents on the SC twin.
JPMorgan LOXM · quant execution desks
RL execution agents like JPMorgan's LOXM are not given a "buy low, sell high" rule. They watch market simulators and historical order books, learning when to split, accelerate or pause a trade. Renaissance, Two Sigma and DE Shaw follow the same pattern: no hand-coded strategy, only learned policies trained on observed market behaviour.
Why supply chain finally fits
Supply chain decisions look nothing like a chess board, a tokamak, or a robot arm. The training pattern is the same: a teacher capable of competent play, a simulator that lets the student watch many games, a reward signal grounded in real outcomes, and a learned policy that surpasses the teacher by finding patterns the teacher's rules cannot encode. The digital twin is the simulator. The ERP-logic heuristics are the teacher. The BSC reward is the outcome signal. Everything Autonomy needs to apply the chess-engine pattern to your supply chain is in place.
The teacher reasons under uncertainty, not hindsight
One detail separates a policy that transfers from one that memorises. The teacher solves each scenario on the information a live agent will actually have at the moment of decision: the forecast of record and its calibrated confidence band, never the demand that turned out. A teacher handed the answer in hindsight teaches the student to reproduce that answer; a teacher that reasons under calibrated uncertainty teaches the student to weigh evidence, which is what carries to conditions it has never seen. Google Research published independent evidence for this in 2026: an AI taught to reason like a Bayesian, maintaining and updating its uncertainty, generalised better to entirely new domains than one taught from a perfect-hindsight oracle. It is the same principle Autonomy's calibrated conformal bands put at the centre of how every agent is trained.
The cognitive cycle
OODA, ORPA, and where Autonomy puts learning
A human operator running a facility moves through a loop on every consequential decision: observe the state, orient against context, decide, act, and start again. Boyd named this OODA. XMPro's MAGS architecture reformulates it for industrial AI agents as ORPA, Observe, Reflect, Plan, Act. The cycles are nearly the same: ORPA splits Boyd's Orient into a distinct Reflect step because in an agent architecture the sense-making computation deserves its own slot.
Autonomy's agents follow the same cognitive shape. Each agent observes canonical state, evaluates that state against its trained policy and the conformal calibration layer, plans an action within its envelope, and acts. The cycle is structurally identical to ORPA. The difference is not the shape of the cycle. It is where the learning happens.
| Step | OODA (Boyd) | ORPA (XMPro / MAGS) | Autonomy |
|---|---|---|---|
| Observe | Sensors, environment | DataStream inputs (engineering signals, calculations) | Canonical state read by the agent's observation hooks |
| Orient / Reflect | Update internal model | LLM-mediated reflection writes into the agent's memory stream | Trained policy forward pass + conformal interval lookup |
| Plan / Decide | Decide | Plan generated via LLM advice within parametric guardrails | Action selected from the trained policy |
| Act | Act | Configured Action Agents write to the physical system | Decision Trace row emitted, canonical state written |
| Learn | Embedded inside Orient | Embedded inside Reflect (LLM memory-stream loop) | Out of the loop. Parametric RL retraining + conformal recalibration + causal posterior updates + LLM-augmented hypothesis pipeline, each on its own cadence. |
The cycles are nearly identical. The learning placement is not.
OODA and ORPA both bake learning into the inner loop. Boyd's Orient step is where the observer's model gets updated; ORPA's Reflect step writes reflections into a memory stream that next-cycle reads. Each cycle does decision-making and learning in the same breath.
Autonomy lifts learning out of the cycle deliberately. The cognitive loop stays bounded, fast, and deterministic against the trained policy. Learning happens in a separate substrate: RL retraining runs on cadence against realised outcomes; conformal intervals recalibrate whenever measured coverage drifts; the causal layer updates override-effectiveness estimands as overrides accumulate; the LLM proposes hypothesis-axis CANDIDATEs that route to a human curator. None of those updates compete for the decision's compute budget; all of them are calibrated and outcome-supervised.
In-loop learning competes for compute with the decision itself and bakes the update into whatever the model's reasoning happened to be in that moment. Out-of-loop learning gets the cross-event time and aggregation it needs to be calibrated.
AI·IO·ML sits around the cycle, not inside it
AI·IO·ML, Autonomy's operating model, is not a replacement for OODA or ORPA. The agent still runs its own cognitive cycle on every decision. AI·IO·ML is the contract that governs the interaction between the substrate and the operator across three couplets: the agent acts (Automate, Inform), the human engages (Inspect, Override), and the system improves (Measure, Learn). The agent's cycle is its private business; AI·IO·ML is what the operator sees and acts on, and what the system measures and learns from. The same contract also governs the learning loop itself: parametric updates Automate when calibrated, Inform when they cross meaningful thresholds, and LLM-proposed CANDIDATEs always Inspect until the confidence head is calibrated.
The learning lifecycle
Pre-deployment training, live-operations calibration, then continuous improvement.
Mapping to the deployment posture: Decision Support → Augmentation → Automation.
Stage 1: Pre-Deployment, Twin Training
Deployment posture: Decision Support
Before an agent makes a single live decision, the platform has already built a digital twin of your supply chain DAG and used it to train the tier agents. A discrete event simulator runs your network forward under many plausible realities; deterministic engines modelled on ERP logic act as the teacher; the resulting (state, action, outcome) trajectories are what the tier agents learn from.
The execution agents do not need this customer-specific phase. They arrive warm-started from a generic corpus and walk their three-phase curriculum once live data is available.
Multiple realities, not Monte Carlo
Autonomy does not Monte-Carlo planning. Uncertainty around a plan is quantified by conformal P10/P50/P90 bands at inference time, not by re-running the twin under noise. The twin's role is to be a training environment, and its scenario sampler is curriculum-driven, not enumerated.
Real history replayed with small stochastic perturbations (demand, lead-time).
One disruption: supplier outage, capacity shock, lane collapse, lead-time inflation, NPI ramp, EOL wind-down, carrier strike, weather.
Two or three simultaneous adversarial events; tests cross-domain coordination.
The curriculum shifts the mix over training, weighting heavier on baseline early (so the agent learns the dominant dynamics first) and tilting toward single-event and compound disruption later (so it learns to recover). Scenarios are stratified across four seasonal regimes, peak, ramp-down, trough, ramp-up, so the agent sees every season your business actually goes through.
Heuristics teach, they don't dictate
The deterministic engines that supervise Phase-1 training encode the same kind of rules your ERP already runs, base-stock-like ordering, lead-time-aware lot sizing, safety-stock allocation by percentage of on-hand, supplier-reliability-weighted re-order. But the agent never sees the rule. It sees the rule's output , what the heuristic chose to do in each scenario, and learns the underlying decision shape by example. This is the chess-engine pattern applied to inventory and supply: the rule sets the floor; the learned policy rises above it.
"Like a chess program that recognises common openings but makes illegal moves in novel positions, an agent trained on too little data will fail on unfamiliar situations. The twin's job is to push agents past that threshold before they ever touch live decisions."
By the time go-live happens, the tier agents have observed hundreds of thousands of simulated decision points across thousands of scenarios, not on a generic dataset, not on Monte Carlo enumeration, but on simulated rollouts of your network under disruptions calibrated to your history. Day-one performance matches or exceeds the heuristic baseline that runs in your ERP today.
Stage 2: Calibration, On-the-Job Adaptation
Deployment posture: Decision Augmentation
Go-live is when the agents stop training in simulation and start refining against reality. The inversion begins here. Agents make decisions inside guardrails; planners shift from making every decision to Inspecting and Overriding agent outputs. Both signals, overrides and measured outcomes, route into continued learning.
The execution agent three-phase curriculum
Each execution agent walks the same three-phase progression at its own site, in order. Earlier phases never stop being usable; later phases add capability when their data threshold is reached.
Engine imitation
Behavioural cloning on the deterministic ERP-logic engine. Always available; no live data required.
Context learning
Supervised learning where planner overrides serve as expert labels. Activates when ~500+ expert decisions are available.
Outcome optimisation
Conservative offline RL on observed business outcomes (BSC reward). Activates when ~1000+ outcome records are available.
Tier agents retrain daily
The graph agents that were warm-trained in the twin now also see real operational data. The tactical (L3) Domain-Model Reconciliation models retrain daily on the previous day's transactions; the strategic L4 Policy Optimisation model retrains weekly on the rolled-up consensus. The Node Coordinator, which modulates cross-agent urgency at each site, updates hourly.
Override classifier, not direct training
Planner overrides do not retrain the model directly. An override classifier routes each override to the right destination: a context-learning label for the execution agent, a feature signal for the Node Coordinator, or an escalation for the tactical (L3) Domain-Model Reconciliation model if the override pattern indicates a structural change. A separate override-effectiveness model tracks per-planner override effectiveness for governance dashboards; it does not feed the training loop.
Reward comes from the BSC, not from matching humans
The reward signal in Phase 3 is BSC utility on actual business outcomes, cost, service level, inventory efficiency, supply smoothness, minus penalties for capacity or SLA violations. Agents are not rewarded for matching what a human would have done; they are rewarded for outcomes the business cares about.
Impact measurement, from direct to causal
For outcomes with clean event-level evidence, a PO arrives, a shipment is on-time, a line is filled, impact is measured directly against the agent's promise. For outcomes that emerge from many interacting decisions, network inventory levels, capacity utilisation, end-customer service, direct measurement is not enough, because a single decision's contribution is confounded with everything else happening on the network.
Today, Autonomy measures override effectiveness via a naive counterfactual:
when a planner overrides an agent, the agent's recommended action is substituted
into the observed environment and re-scored (the
agent_counterfactual_reward channel).
A first structural counterfactual is live in narrow scope, inventory-buffer
overrides in the SCP plane are replayed forward through the digital twin with the
agent's recommended buffer installed at the swap point, and the resulting service-
level / inventory trajectory is what the override is scored against. Extending this
to all overrides, full
Causal AI,
with longitudinal causal estimators where the twin is too coarse, is on the
near-term roadmap and is one of Autonomy's Four Pillars.
The Operating Knowledge layer captures the behavioural patterns that emerge from this calibration: how your planners interpret exceptions, which supplier signals matter, which override patterns consistently improve outcomes. That layer is what makes the agents specific to your operation, not just generically good at supply chain. Together with the per-decision audit (Decision Trace) and the substrate's compiled operating policy (Learned Judgment), Operating Knowledge is one of three buckets that the How Agents Learn whitepaper walks through end to end.
"We know more than we can tell. GenAI provides the first technologically tractable mechanism to capture the experiential ontology, the behavioural knowledge that experienced planners carry, before it's lost."
Stage 3: Continuous, CDC-Driven Retraining
Deployment posture: Decision Automation
Operations change. Suppliers shift lead times, demand patterns evolve seasonally, new products arrive, old products retire. An agent trained on last quarter's dynamics will drift if it is not maintained. Autonomy detects drift and retrains automatically.
Baseline cadences
The platform retrains on a fixed schedule regardless of whether drift is detected. Conformal coverage drift detection runs independently of that schedule and can accelerate any of these cadences when it triggers.
These are defaults. Every tenant has its own training-cadence configuration, so a distributor with high-velocity demand can pull execution-agent and L3-model retraining in to 6-hourly; a pharma manufacturer with stable, regulated planning can push S&OP out further. The cadence knob is per-agent-type and per-tenant, not a platform constant.
Regression guard
Retraining is not deployed blindly. Every candidate checkpoint is evaluated against the current production model on held-out trajectories. If the new model regresses on BSC utility, it is discarded; only checkpoints that match or improve get promoted.
Conformal prediction, never Monte Carlo
Uncertainty around any plan is quantified by conformal P10/P50/P90 bands at inference time. The digital twin is never re-run under noise to "Monte Carlo a plan". This is enforced as an invariant in the platform, the twin's PLAN_PRODUCTION mode raises an error if any stochasticity knob is left on. Drift detection then watches the empirical coverage of those bands: if the predictor promised 90% coverage and the observed coverage falls materially below it, the bands are recalibrated and the upstream policy is retrained.
Escalation
If retraining at one tier cannot close the drift, a structural change like a supplier permanently doubling lead times, the failure escalates upward. The L1 execution agent cannot fix policy; the tactical (L3) Domain-Model Reconciliation model re-optimises buffer levels; the strategic (L4) Policy Optimisation model re-evaluates network design. The escalation chain runs automatically.
Provisioning pipeline
What runs once, before a customer goes live.
- 1 Ingest SC DAG (sites, products, BOMs, lanes, capacities) from ERP
- 2 Build digital twin; configure ERP-logic heuristics as Phase-1 teachers
- 3 Calibrate stochastic distributions (demand CV, lead-time CV) to history
- 4 Train L4 Policy Optimisation model on aggregated weekly consensus
- 5 Train demand forecasting agent on history + seasonal regimes
- 6 Derive policy parameters (network θ, guardrails, KPI targets)
- 7 Train tactical (L3) Domain-Model Reconciliation models on DES rollouts (curriculum + seasonal regimes)
- 8 Train Node Coordinator for cross-agent urgency modulation
- 9 Warm-start execution agents from generic corpus registry; bind to sites
- 10 Generate initial supply plan (Path C: unconstrained reference, then constrained_live)
- 11 Rough-cut capacity check on the constrained plan
- 12 Conformal calibration: derive P10/P50/P90 bands from held-out trajectories
- 13 Seed the Decision Stream; bind governance pipeline and AI·IO·ML posture per site
- 14 Executive briefing: BSC weights, guardrails, escalation tiers
The compounding advantage
Two things compound as a customer runs longer on Autonomy. The first is the tier agents' familiarity with your DAG, every season they see, every disruption they recover from, every supplier behaviour they internalise makes the next decision better. The second is the Operating Knowledge layer built from your planners' overrides and the outcomes that followed.
A competitor adopting the same platform would arrive at Phase 1: their execution agents warm-started from the same generic corpus, their tier agents trained on their DAG and seasons. Their agents would be competent on day one, just as yours were. But they would not have your years of seasonal cycles in their L3 model weights, and they would not have your planners' decade of override patterns shaping their execution agents. The platform is the same; the trained-on-your-business specificity is not.
Deeper read
The full architecture, the OODA / ORPA / Autonomy comparison in depth, the role of the LLM in the learning loop, and the per-period Learning Digest are all covered in the How Agents Learn whitepaper.
See how agents learn your business
Walk through the twin, the curriculum, and the calibration loop in a live demo.