Learning lifecycle
How Agents Learn
Autonomy agents don't ship with generic, pre-trained models. Each agent goes through a structured three-stage learning lifecycle tailored to your data, and each stage moves your organization further along the agentic inversion: from human-in-the-loop to human-on-the-loop to full AIIO operation. The result is agents that start competent, improve continuously, and progressively take ownership of decisions while humans shift from execution to governance.
"This is not automation (same tasks, faster). It's inversion: the structural shift in who performs economic work."
"A smaller model with sufficient data outperforms a larger model with insufficient data on rule learning and generalization."
The Learning Lifecycle
Three stages: study, practice, and continuous improvement
Stage 1: Study, Human in the Loop
Before an agent makes a single live decision, it studies your supply chain. At this stage agents are fully supervised, human in the loop. Planners retain full control while agents learn by watching, the same way a chess AI learns: by observing hundreds of thousands of expert games before playing its first tournament match.
Autonomy generates 450,000+ synthetic scenarios using Monte Carlo simulation (128 stochastic runs across 52 weeks). Every variable is randomized, demand, lead times, throughput, quality, transportation capacity, drawn from distributions calibrated to your actual data. Against each stochastic scenario, four deterministic heuristic policies calculate the baseline response. The agents learn both the controls (what action to take) and the strategies (why that action works) by watching these expert heuristics respond to thousands of different conditions:
Order up to target inventory level. Simple, stable baseline.
4-period moving average. Smoothed ordering with low bullwhip effect.
Proportional-integral-derivative on inventory error. Responsive to change.
Economic order quantity with reorder point. Cost-optimized ordering.
Research has shown that the volume of scenarios an AI studies matters more than the complexity of its architecture. A smaller, efficient model that has observed hundreds of thousands of expert decisions will outperform a much larger model that has only seen a few thousand (Stöckl, RANLP 2021). This is why each Autonomy agent trains past the threshold where it memorizes patterns and into the regime where it internalizes the underlying decision rules.
"Like a chess program that recognizes common openings but makes illegal moves in novel positions, an agent trained on too little data will fail on unfamiliar situations. Our agents are trained past this threshold."
The warm-start produces agents that achieve 85-90% of optimal performance from day one — competent enough to handle routine decisions, but still bounded by what the heuristic teachers could demonstrate. Stage 2 takes them beyond this baseline.
Stage 2: On the Job, Human on the Loop
This is where the inversion begins. Agents start making decisions within guardrails, human on the loop. Planners shift from making every decision to Inspecting and Overriding agent decisions. Every decision becomes a data point for improvement.
Each decision type has its own feedback horizon:
The agent receives a reward signal based on actual business outcomes — cost reduction, service level improvement, inventory efficiency, rather than just whether it matched what a human would have done. Over time, the agent discovers patterns that consistently earn better outcomes than the historical baseline. These patterns are reinforced; poor patterns are weakened.
The Context Engine enriches this learning with real-world signals: documents uploaded by planners, natural language directives from leadership, and email alerts from suppliers. Executive directives shape the agent's priorities, "optimize for service level this quarter" shifts the reward weights accordingly.
Planner overrides are not wasted. Each override is recorded with full context, and if overrides from a particular planner consistently lead to better outcomes, that planner's judgment receives higher weight in the next training cycle. The system tracks override effectiveness using Bayesian posterior updates, ensuring the agents continuously align with human expertise.
This is what Knut Alicke calls building the experiential ontology, the behavioral knowledge about how your operations actually work in practice. For thirty years, experienced planners have been the missing semantic layer: interpreting exceptions, understanding supplier behavior, making causal connections across domains. That knowledge lives in their heads, and when they retire, it leaves with them. Autonomy captures it systematically, every override, every coaching signal, every directive builds the experiential layer that planning systems have always lacked.
"We know more than we can tell. GenAI provides the first technologically tractable mechanism to capture the experiential ontology, the behavioral knowledge that experienced planners carry, before it's lost."
Stage 3: Continuous Improvement, Human out of the Loop
The inversion completes. Agents operate autonomously, human out of the loop for routine decisions. Planners focus on governance, exception handling, and strategic judgment. The system monitors itself and retrains automatically when it detects that agent performance is drifting. No data science team required.
Operations change. Suppliers shift lead times, demand patterns evolve seasonally, new products are introduced. An agent trained on last year's data will gradually become less accurate. Autonomy detects this drift and adapts.
Data Drift Detection
The CDC monitor watches seven metrics in real-time. When any threshold is breached, retraining is triggered:
Regression Guard
Retraining is not blind. Every new model checkpoint is compared against the current production model. If the new model regresses, performs worse on validation data, it is discarded automatically. Only improvements are deployed.
Escalation
If drift persists across three or more retraining cycles without improvement, the system escalates to a higher tier. An execution-level agent cannot fix a structural change — like a supplier permanently doubling lead times. That requires the tactical or strategic planning agents to re-optimize policy parameters. The escalation happens automatically.
14-Step Provisioning Pipeline
Every AI tier is bootstrapped before receiving directives or making decisions
- 1 Monte Carlo simulation with deterministic heuristics
- 2 Network planning agent
- 3 Policy parameter optimization
- 4 Demand forecasting
- 5 Demand planning agent
- 6 Supply planning agent
- 7 Inventory optimization
- 8 Execution role agent training
- 9 Supply plan generation
- 10 Rough-cut capacity check
- 11 Decision Stream seeding
- 12 Site agent training
- 13 Uncertainty calibration
- 14 Executive briefing
The Compounding Advantage
Each stage builds on the previous one, and each stage deepens the agentic inversion. After warm-start, agents match historical performance and planners retain full control. After on-the-job learning, agents exceed human baseline and planners shift to governance. After continuous improvement, agents operate autonomously while staying current, capturing planner expertise as overrides and becoming increasingly specific to your supply chain.
The longer the system runs, the harder this advantage is to replicate. A competitor deploying the same platform would start at Stage 1. Your agents would already be at Stage 3, trained on years of your specific dynamics and your planners' expertise. This is what makes the learning flywheel a durable competitive advantage, not just a one-time efficiency gain.
See How Agents Learn Your Business
Watch agents train on your data in a live demo.