Azirella
← Back to Autonomy

Learning lifecycle

How Agents Learn

Autonomy agents don't ship with generic, pre-trained models. Each agent goes through a structured three-stage learning lifecycle tailored to your data, and each stage moves your organization further along the agentic inversion: from human-in-the-loop to human-on-the-loop to full AIIO operation. The result is agents that start competent, improve continuously, and progressively take ownership of decisions while humans shift from execution to governance.

"This is not automation (same tasks, faster). It's inversion: the structural shift in who performs economic work."

, Jordi Visser, "The Agentic Inversion" (2026)

"A smaller model with sufficient data outperforms a larger model with insufficient data on rule learning and generalization."

, Niklas Stöckl, "Watching a Language Model Learning Chess" (RANLP 2021)

The Learning Lifecycle

Three stages: study, practice, and continuous improvement

BEFORE GO-LIVE Study Warm-Start Monte Carlo Simulation 128 runs × 52 weeks Stochastic: Demand · Lead Time Throughput · Quality · Transport Deterministic Heuristics: BASE_STOCK · CONSERVATIVE PID · EOQ 450,000+ scenarios Human in the loop FIRST 3-6 MONTHS On the Job On-the-Job Learning Decisions Documented Outcomes Measured Feedback Horizons: ATP 4h · Buffer 24h · PO 7d Exceeds human baseline Human on the loop ONGOING · AUTONOMOUS Continuous CDC Relearning Data Drift Detection Automatic Retraining Regression Guard Retrains every 6h when drift exceeds threshold Zero maintenance Human out of the loop CDC Feedback Loop Context Engine Docs · Talk · Email Planner Overrides Accept · Adjust · Override

Stage 1: Study, Human in the Loop

Before an agent makes a single live decision, it studies your supply chain. At this stage agents are fully supervised, human in the loop. Planners retain full control while agents learn by watching, the same way a chess AI learns: by observing hundreds of thousands of expert games before playing its first tournament match.

Autonomy generates 450,000+ synthetic scenarios using Monte Carlo simulation (128 stochastic runs across 52 weeks). Every variable is randomized, demand, lead times, throughput, quality, transportation capacity, drawn from distributions calibrated to your actual data. Against each stochastic scenario, four deterministic heuristic policies calculate the baseline response. The agents learn both the controls (what action to take) and the strategies (why that action works) by watching these expert heuristics respond to thousands of different conditions:

BASE_STOCK

Order up to target inventory level. Simple, stable baseline.

CONSERVATIVE

4-period moving average. Smoothed ordering with low bullwhip effect.

PID

Proportional-integral-derivative on inventory error. Responsive to change.

EOQ

Economic order quantity with reorder point. Cost-optimized ordering.

Research has shown that the volume of scenarios an AI studies matters more than the complexity of its architecture. A smaller, efficient model that has observed hundreds of thousands of expert decisions will outperform a much larger model that has only seen a few thousand (Stöckl, RANLP 2021). This is why each Autonomy agent trains past the threshold where it memorizes patterns and into the regime where it internalizes the underlying decision rules.

"Like a chess program that recognizes common openings but makes illegal moves in novel positions, an agent trained on too little data will fail on unfamiliar situations. Our agents are trained past this threshold."

The warm-start produces agents that achieve 85-90% of optimal performance from day one — competent enough to handle routine decisions, but still bounded by what the heuristic teachers could demonstrate. Stage 2 takes them beyond this baseline.

Stage 2: On the Job, Human on the Loop

This is where the inversion begins. Agents start making decisions within guardrails, human on the loop. Planners shift from making every decision to Inspecting and Overriding agent decisions. Every decision becomes a data point for improvement.

Each decision type has its own feedback horizon:

4 hours
ATP decisions
24 hours
Inventory buffers
7 days
Purchase orders

The agent receives a reward signal based on actual business outcomes — cost reduction, service level improvement, inventory efficiency, rather than just whether it matched what a human would have done. Over time, the agent discovers patterns that consistently earn better outcomes than the historical baseline. These patterns are reinforced; poor patterns are weakened.

The Context Engine enriches this learning with real-world signals: documents uploaded by planners, natural language directives from leadership, and email alerts from suppliers. Executive directives shape the agent's priorities, "optimize for service level this quarter" shifts the reward weights accordingly.

Planner overrides are not wasted. Each override is recorded with full context, and if overrides from a particular planner consistently lead to better outcomes, that planner's judgment receives higher weight in the next training cycle. The system tracks override effectiveness using Bayesian posterior updates, ensuring the agents continuously align with human expertise.

This is what Knut Alicke calls building the experiential ontology, the behavioral knowledge about how your operations actually work in practice. For thirty years, experienced planners have been the missing semantic layer: interpreting exceptions, understanding supplier behavior, making causal connections across domains. That knowledge lives in their heads, and when they retire, it leaves with them. Autonomy captures it systematically, every override, every coaching signal, every directive builds the experiential layer that planning systems have always lacked.

"We know more than we can tell. GenAI provides the first technologically tractable mechanism to capture the experiential ontology, the behavioral knowledge that experienced planners carry, before it's lost."

Stage 3: Continuous Improvement, Human out of the Loop

The inversion completes. Agents operate autonomously, human out of the loop for routine decisions. Planners focus on governance, exception handling, and strategic judgment. The system monitors itself and retrains automatically when it detects that agent performance is drifting. No data science team required.

Operations change. Suppliers shift lead times, demand patterns evolve seasonally, new products are introduced. An agent trained on last year's data will gradually become less accurate. Autonomy detects this drift and adapts.

Data Drift Detection

The CDC monitor watches seven metrics in real-time. When any threshold is breached, retraining is triggered:

Demand deviation ±15% from forecast
Inventory low <70% of safety stock
Inventory high >150% of target
Service level drop >5% below target
Lead time increase +30% vs baseline
Backlog growth 2+ consecutive days
Supplier reliability <80% on-time rate

Regression Guard

Retraining is not blind. Every new model checkpoint is compared against the current production model. If the new model regresses, performs worse on validation data, it is discarded automatically. Only improvements are deployed.

Escalation

If drift persists across three or more retraining cycles without improvement, the system escalates to a higher tier. An execution-level agent cannot fix a structural change — like a supplier permanently doubling lead times. That requires the tactical or strategic planning agents to re-optimize policy parameters. The escalation happens automatically.

14-Step Provisioning Pipeline

Every AI tier is bootstrapped before receiving directives or making decisions

Tier 1, Foundation
  1. 1 Monte Carlo simulation with deterministic heuristics
Tier 2, Strategic
  1. 2 Network planning agent
  2. 3 Policy parameter optimization
  3. 4 Demand forecasting
Tier 3, Operational
  1. 5 Demand planning agent
  2. 6 Supply planning agent
  3. 7 Inventory optimization
  4. 8 Execution role agent training
  5. 9 Supply plan generation
  6. 10 Rough-cut capacity check
Tier 4, Activation
  1. 11 Decision Stream seeding
  2. 12 Site agent training
  3. 13 Uncertainty calibration
  4. 14 Executive briefing

The Compounding Advantage

Each stage builds on the previous one, and each stage deepens the agentic inversion. After warm-start, agents match historical performance and planners retain full control. After on-the-job learning, agents exceed human baseline and planners shift to governance. After continuous improvement, agents operate autonomously while staying current, capturing planner expertise as overrides and becoming increasingly specific to your supply chain.

The longer the system runs, the harder this advantage is to replicate. A competitor deploying the same platform would start at Stage 1. Your agents would already be at Stage 3, trained on years of your specific dynamics and your planners' expertise. This is what makes the learning flywheel a durable competitive advantage, not just a one-time efficiency gain.

Study
Human in the loop
On the Job
Human on the loop
Continuous
Human out of the loop

See How Agents Learn Your Business

Watch agents train on your data in a live demo.