Autoresearch for LWIR Super-Resolution¶

Karpathy's autoresearch is an autonomous experiment loop where an LLM agent iteratively modifies training code, runs experiments against a fixed time budget, and keeps only improvements. This article covers how to adapt the pattern for LWIR super-resolution — something no one has published before.

The 3-File Architecture¶

The system enforces a strict separation of concerns across three files:

prepare.py — Immutable. Data loading, evaluation metric computation, constants. The agent never touches this file. This is the anti-cheating foundation: the agent cannot game the metric because it cannot modify how the metric is computed.
train.py — The single mutable file. Model definition, optimizer, training loop, loss function. Everything is fair game for the agent to modify.
program.md — Agent instructions. Rules, metric definition, what the agent can and cannot do, hardware constraints (VRAM budget).

The Experiment Loop¶

Inspect git state (see what the current code looks like)
Modify train.py with an experimental idea
git commit
Run training with a fixed time budget
Read results (grep for the metric value)
If improved: keep the commit. If worse: git reset.
Loop forever.

Git ratcheting: The branch can only move forward. Each successful experiment is a permanent commit. Each failure is reverted. The git log becomes a linear record of validated improvements. This is a powerful property — you can read the git history to understand what worked.

The 10-Minute Budget¶

The original autoresearch uses a 5-minute budget. SR training is slower than language model training, so we need more time per experiment.

Recommendation: 10 minutes (TIME_BUDGET=600).

With 10K patches and batch size 16, this gives ~625 steps/epoch and 5-10 epochs per experiment — enough signal for relative comparison between architectures. Throughput: ~5.5 experiments/hour, ~44 overnight on an RTX 5080.

If 10 minutes is insufficient for meaningful signal, extend to 15 minutes (~32 experiments overnight). Fewer experiments but each is more reliable.

Metric Choice¶

Primary gate: val_psnr (higher is better). Universal, easy to compute, sensitive to small improvements. The keep/discard logic is inverted from the original autoresearch (which minimized val_bpb): keep if higher, not lower.

Secondary (logged but not gated): val_ssim. If PSNR goes up but SSIM drops, the agent may be finding a blurring trick that exploits PSNR. Investigate but do not automatically reject.

Note: for full evaluation of final models, see Evaluation Strategy for the complete metric stack including LR-consistency checks and sharpness gain. The autoresearch metric is deliberately simple for fast iteration.

Anti-Cheating Measures¶

Autoresearch agents are surprisingly creative at gaming metrics. Five safeguards:

Evaluation lives in prepare.py — the immutable file. The agent literally cannot modify how the metric is computed.
Bicubic baseline is established — the model must beat bicubic interpolation or the experiment is a failure.
Output clamped to [0,1] in evaluation — prevents the agent from exploiting unbounded outputs.
Validation patches from different scenes than training — no data leakage possible.
Random seed pinned for evaluation only — reproducible comparisons between experiments while allowing training stochasticity.

See also: Cerebras: How to stop your autoresearch loop from cheating for general anti-cheating principles.

What the Agent Will Likely Try¶

Based on published autoresearch results and SR domain knowledge:

Early wins (experiments 1-20): Loss function changes (L1 to Charbonnier to mixed), optimizer tuning (learning rate, scheduler), basic architecture changes (more/fewer residual blocks, wider channels), batch size adjustments.

Mid-run (20-50): Attention mechanisms, different upsampling strategies (PixelShuffle vs transposed conv vs interpolation), progressive upsampling, mixed precision training.

Later (50+): Transformer blocks, RRDB modules, multi-scale feature extraction, frequency-domain loss branches.

The Phased Approach¶

The scope of MFSR (alignment + fusion + reconstruction) is too large for a single train.py. The recommended approach is phased:

Phase 1: Single-frame SR autoresearch on existing tile data. - Proves the autoresearch pattern works on LWIR - Discovers the optimal backbone architecture (encoder-decoder, residual blocks, attention type) - Uses existing training tiles — no new data collection needed - ~44 experiments overnight

Phase 2: Manual MFSR integration. - Take the winning single-frame architecture from Phase 1 - Manually integrate it into the multi-frame pipeline with alignment and temporal fusion - This is too complex for the agent to discover from scratch in a single train.py

Phase 3: Autoresearch on fusion/loss components. - Once the MFSR pipeline is working, freeze the architecture - Run autoresearch again to optimize fusion strategy and loss function weights - Narrower scope = more productive agent iterations

Critical Gotchas¶

Dataset size matters. Fewer than 1K patches produces noisy metrics. Use 5K+ training patches and 500+ validation patches.
Normalize consistently. 16-bit to float32 [0,1] conversion must happen in prepare.py only.
Watch VRAM. The agent may try a 50M-parameter model that OOMs. State the VRAM budget explicitly in program.md.
Results are not portable. Winning architectures on your specific GPU, dataset, and time budget may differ at production scale. Use autoresearch to discover promising directions, then validate with full-length training.
Do not freeze components across rounds. Co-optimize in a single run rather than sequentially freezing pieces, which can find local optima.

Results from Others¶

Project	Experiments	Kept	Result
Karpathy (GPT-2)	276	29	11% training speedup on already-optimized code
Shopify (Tobi Lutke)	37	—	0.8B model scored 19% higher than manually-tuned 1.6B
Shopify Liquid	~120	93	53% faster parse+render, 61% fewer allocations
AutoKernel	—	—	18 TFLOPS to 187 TFLOPS
Vesuvius Challenge	continuous	—	Cross-scroll generalization nearly doubled

No existing SR fork of autoresearch has been published. We would be the first.

LLM Backend¶

Use Claude Sonnet via API. Approximately $0.10/experiment, ~$10 for 100 overnight runs. Use Opus for more creative architectural changes if budget allows. The GPU should be 100% dedicated to training — do not run a local LLM simultaneously.

Comparison: Autoresearch vs AI Scientist v2¶

	Autoresearch	AI Scientist v2
Scope	Single metric optimization	Full research pipeline to paper
Output	Optimized code + results	Scientific manuscript
Cost	$5-20 for 100 experiments	$25-35 per pipeline
Our use case	Optimize a specific model	Not needed

For our purposes, autoresearch is the right tool. We want to optimize, not write papers.