Autoresearch for LWIR Super-Resolution¶
Karpathy's autoresearch is an autonomous experiment loop where an LLM agent iteratively modifies training code, runs experiments against a fixed time budget, and keeps only improvements. This article covers how to adapt the pattern for LWIR super-resolution — something no one has published before.
The 3-File Architecture¶
The system enforces a strict separation of concerns across three files:
prepare.py— Immutable. Data loading, evaluation metric computation, constants. The agent never touches this file. This is the anti-cheating foundation: the agent cannot game the metric because it cannot modify how the metric is computed.train.py— The single mutable file. Model definition, optimizer, training loop, loss function. Everything is fair game for the agent to modify.program.md— Agent instructions. Rules, metric definition, what the agent can and cannot do, hardware constraints (VRAM budget).
The Experiment Loop¶
- Inspect git state (see what the current code looks like)
- Modify
train.pywith an experimental idea git commit- Run training with a fixed time budget
- Read results (grep for the metric value)
- If improved: keep the commit. If worse:
git reset. - Loop forever.
Git ratcheting: The branch can only move forward. Each successful experiment is a permanent commit. Each failure is reverted. The git log becomes a linear record of validated improvements. This is a powerful property — you can read the git history to understand what worked.
The 10-Minute Budget¶
The original autoresearch uses a 5-minute budget. SR training is slower than language model training, so we need more time per experiment.
Recommendation: 10 minutes (TIME_BUDGET=600).
With 10K patches and batch size 16, this gives ~625 steps/epoch and 5-10 epochs per experiment — enough signal for relative comparison between architectures. Throughput: ~5.5 experiments/hour, ~44 overnight on an RTX 5080.
If 10 minutes is insufficient for meaningful signal, extend to 15 minutes (~32 experiments overnight). Fewer experiments but each is more reliable.
Metric Choice¶
Primary gate: val_psnr (higher is better). Universal, easy to compute, sensitive to small improvements. The keep/discard logic is inverted from the original autoresearch (which minimized val_bpb): keep if higher, not lower.
Secondary (logged but not gated): val_ssim. If PSNR goes up but SSIM drops, the agent may be finding a blurring trick that exploits PSNR. Investigate but do not automatically reject.
Note: for full evaluation of final models, see Evaluation Strategy for the complete metric stack including LR-consistency checks and sharpness gain. The autoresearch metric is deliberately simple for fast iteration.
Anti-Cheating Measures¶
Autoresearch agents are surprisingly creative at gaming metrics. Five safeguards:
- Evaluation lives in
prepare.py— the immutable file. The agent literally cannot modify how the metric is computed. - Bicubic baseline is established — the model must beat bicubic interpolation or the experiment is a failure.
- Output clamped to [0,1] in evaluation — prevents the agent from exploiting unbounded outputs.
- Validation patches from different scenes than training — no data leakage possible.
- Random seed pinned for evaluation only — reproducible comparisons between experiments while allowing training stochasticity.
See also: Cerebras: How to stop your autoresearch loop from cheating for general anti-cheating principles.
What the Agent Will Likely Try¶
Based on published autoresearch results and SR domain knowledge:
Early wins (experiments 1-20): Loss function changes (L1 to Charbonnier to mixed), optimizer tuning (learning rate, scheduler), basic architecture changes (more/fewer residual blocks, wider channels), batch size adjustments.
Mid-run (20-50): Attention mechanisms, different upsampling strategies (PixelShuffle vs transposed conv vs interpolation), progressive upsampling, mixed precision training.
Later (50+): Transformer blocks, RRDB modules, multi-scale feature extraction, frequency-domain loss branches.
The Phased Approach¶
The scope of MFSR (alignment + fusion + reconstruction) is too large for a single train.py. The recommended approach is phased:
Phase 1: Single-frame SR autoresearch on existing tile data. - Proves the autoresearch pattern works on LWIR - Discovers the optimal backbone architecture (encoder-decoder, residual blocks, attention type) - Uses existing training tiles — no new data collection needed - ~44 experiments overnight
Phase 2: Manual MFSR integration.
- Take the winning single-frame architecture from Phase 1
- Manually integrate it into the multi-frame pipeline with alignment and temporal fusion
- This is too complex for the agent to discover from scratch in a single train.py
Phase 3: Autoresearch on fusion/loss components. - Once the MFSR pipeline is working, freeze the architecture - Run autoresearch again to optimize fusion strategy and loss function weights - Narrower scope = more productive agent iterations
Critical Gotchas¶
- Dataset size matters. Fewer than 1K patches produces noisy metrics. Use 5K+ training patches and 500+ validation patches.
- Normalize consistently. 16-bit to float32 [0,1] conversion must happen in
prepare.pyonly. - Watch VRAM. The agent may try a 50M-parameter model that OOMs. State the VRAM budget explicitly in
program.md. - Results are not portable. Winning architectures on your specific GPU, dataset, and time budget may differ at production scale. Use autoresearch to discover promising directions, then validate with full-length training.
- Do not freeze components across rounds. Co-optimize in a single run rather than sequentially freezing pieces, which can find local optima.
Results from Others¶
| Project | Experiments | Kept | Result |
|---|---|---|---|
| Karpathy (GPT-2) | 276 | 29 | 11% training speedup on already-optimized code |
| Shopify (Tobi Lutke) | 37 | — | 0.8B model scored 19% higher than manually-tuned 1.6B |
| Shopify Liquid | ~120 | 93 | 53% faster parse+render, 61% fewer allocations |
| AutoKernel | — | — | 18 TFLOPS to 187 TFLOPS |
| Vesuvius Challenge | continuous | — | Cross-scroll generalization nearly doubled |
No existing SR fork of autoresearch has been published. We would be the first.
LLM Backend¶
Use Claude Sonnet via API. Approximately $0.10/experiment, ~$10 for 100 overnight runs. Use Opus for more creative architectural changes if budget allows. The GPU should be 100% dedicated to training — do not run a local LLM simultaneously.
Comparison: Autoresearch vs AI Scientist v2¶
| Autoresearch | AI Scientist v2 | |
|---|---|---|
| Scope | Single metric optimization | Full research pipeline to paper |
| Output | Optimized code + results | Scientific manuscript |
| Cost | $5-20 for 100 experiments | $25-35 per pipeline |
| Our use case | Optimize a specific model | Not needed |
For our purposes, autoresearch is the right tool. We want to optimize, not write papers.