FRAMEWORM AGENT¶
Autonomous training monitor. Watches your loss curves and gradient norms in real time, detects anomalies, and intervenes automatically — without you watching.
Quick Start¶
# Start the agent alongside your training run
python -m frameworm agent start \
--config configs/models/vae/vae.yaml \
--model VAE \
--total-steps 50000
What It Detects¶
| Anomaly | Detection Method | Agent Action |
|---|---|---|
| Gradient explosion | Grad norm > threshold or loss > 500 | Cut LR by 95%, reinit model if dead |
| Loss divergence | Rolling window: recent mean > 1.25x earlier mean | Reduce LR by 85% |
| Training plateau | Near-zero variance at elevated loss | Full model reinit + reset to healthy LR |
| Oscillation | Coefficient of variation > threshold | Reduce LR by 90% |
| Loss spike | Statistical outlier in window | Watch mode |
Benchmark Results¶
10 experiments across 4 failure conditions, each paired with a shadow run (no agent) as counterfactual comparison. Run with a VAE on synthetic data, CPU only.
# Experiment Condition Agent Loss Shadow Loss D Loss Detected Intervened Resolved
──────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 Baseline-1 baseline 70.39 70.37 -0.02 0 - -
2 Baseline-2 baseline 72.61 70.35 -2.27 0 - -
3 Baseline-3 baseline 72.48 70.39 -2.09 0 - -
4 HighLR-1 lr=8e-02 75.71 9999.00 +9923.29 3 4 YES
5 HighLR-2 lr=5e-02 69.99 70.37 +0.37 3 3 YES
6 GradExplosion-1 grad_explosion 75.93 70.35 -5.58 4 5 YES
7 GradExplosion-2 grad_explosion 76.07 70.35 -5.72 5 6 YES
8 LowLR-Plateau-1 lr=1e-06 94.65 283.47 +188.82 16 16 NO*
9 LowLR-Plateau-2 lr=5e-07 95.31 285.57 +190.26 16 16 NO*
10 AgentIntervenes lr=1e-01 75.84 9999.00 +9923.16 4 5 YES
──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Baseline avg loss: 71.83
Runs with anomalies detected: 7 / 10
Anomalies resolved by agent: 5 / 7
Resolution rate: 71.4%
Mean loss delta (agent-shadow): +2021.02
False positives on clean runs: 0 / 3
What these numbers prove:
- 0 false positives on clean baselines — the agent stays completely silent on healthy training
- Experiments 4 and 10: agent loss ~75 vs shadow loss 9999 — runs that would have fully diverged were saved entirely by the agent
- Experiments 6 and 7: gradient explosion injected at epoch 4, detected and recovered, final loss within 8% of healthy baseline
- Experiments 8 and 9 (NO*): plateau detected 16 times each, agent reduced loss from ~285 to ~95 (+190 improvement). Full convergence within the epoch budget was not achieved — this is a known limitation. Persistent plateaus may require full reinitialisation rather than LR adjustment alone, and is a direction for future work.
Architecture¶
Training Loop FRAMEWORM AGENT
───────────────── ─────────────────────────────
trainer._train_step() ──────────> SignalWindow (rolling observer)
trainer._agent_plugin <────────── │
classify() → anomaly type
│
suggest_action()
│
┌────┴────────────────┐
│ ADJUST_LR │ REINIT
│ scale optimizer │ reinit weights
│ param groups │ reset optimizer
└────────────────────-┘
│
ExperienceBuffer → SQLite DB
Configuration¶
All thresholds in configs/agent_thresholds.yaml:
agent:
grad_explosion_threshold: 10.0
loss_spike_z_score: 3.0
plateau_min_steps: 100
cooldown_steps: 200
healthy_loss_ceiling: 95.0
plateau_loss_floor: 80.0
Checking Agent Status¶
Running the Benchmark¶
# Full 10-experiment benchmark
python run_10_experiments.py
# Results saved to:
# experiments/agent_benchmark/results_table.txt
# experiments/agent_benchmark/experiment_log.jsonl
# experiments/agent_benchmark/benchmark.db
The 4-Line Training Hook¶
In training/trainer.py, at the end of your _train_step():
# FRAMEWORM AGENT hook
if hasattr(self, '_agent_plugin') and self._agent_plugin is not None:
if self.global_step % 50 == 0:
self._agent_plugin.check_commands()
That is the only change to existing training code.
Test Suite¶
pytest tests/agent/ -v
pytest tests/agent/test_classifier.py -v
pytest tests/agent/test_counterfactual.py -v
pytest tests/agent/test_policy.py -v
Research Usage¶
For counterfactual evaluation (shadow run comparison):
from agent.counterfactual.delta_tracker import DeltaTracker
from agent.counterfactual.eval_report import EvalReportGenerator
tracker = DeltaTracker()
gen = EvalReportGenerator(tracker)
report = gen.generate()
print(report.to_markdown_table())
For the POMDP formalisation: