Struggle: PIUnet Could Not Beat Bicubic Upsampling¶

Context¶

At the November 3, 2025 boss meeting, PIUnet inference was run on 3 test sequences from flight 21052 and compared against bicubic upsampling of the same inputs. The results were unambiguous (source: raw/meeting-notes/BOSS_MEETING_SUMMARY.md, lines 79-101):

Sequence	PIUnet PSNR	Bicubic PSNR	Delta
seq000	21.64 dB	22.64 dB	-0.99 dB
seq001	19.40 dB	19.75 dB	-0.35 dB
seq002	19.86 dB	20.27 dB	-0.41 dB

PIUnet was worse than bicubic on every sequence, by every metric (PSNR and SSIM). A deep learning MFSR network with 8 input frames, trained on hundreds of tiles, was producing objectively worse output than a parameter-free interpolation method.

The Hypothesis¶

A deep learning multi-frame super-resolution network should be able to outperform trivial single-image upsampling. PIUnet has access to 8 temporally distinct observations of the same ground scene, each captured from a slightly different viewpoint. In principle, these sub-pixel offsets contain information that bicubic cannot exploit. The hypothesis was that PIUnet would learn to fuse this information and produce sharper, more accurate super-resolution output.

The Failure Mode¶

PIUnet scored approximately 0.35-0.99 dB below bicubic on all 3 test sequences. Both PSNR and SSIM were consistently worse. The model wasn't just failing to improve — it was actively degrading the output relative to the trivial baseline.

Visual inspection confirmed: PIUnet outputs showed subtle artifacts and slightly softer edges compared to bicubic, despite having 8 frames of information to work with.

The Root Cause¶

No single root cause — this was a convergence of multiple failures:

1. Spatially-invariant alignment (TERN)¶

PIUnet's alignment module, TERN, uses a single 5x5 spatially-invariant kernel to align each supporting frame to the reference frame (source: raw/research/brainstorming_notes.md, lines 227-229). This means the same alignment is applied everywhere in the image, regardless of local content. Real misalignment between aerial thermal frames includes:

Parallax from tall objects — trees and buildings shift differently depending on camera position. A single global transform cannot correct this (source: raw/research/brainstorming_notes.md, lines 33-44).
Non-rigid distortions — atmospheric turbulence, lens distortion residuals, and rolling shutter effects create spatially varying misalignment.
Sub-pixel registration error — the homographies used to register frames to the mosaic introduce their own error.

When alignment is wrong, the fusion module receives misaligned features and produces blurred or artifact-laden output. The network may actually be learning to suppress high-frequency content to minimize loss on misaligned pixels — producing output that is intentionally softer than bicubic.

2. PSNR penalizes registration error, not SR quality¶

Plain PSNR computes a pixel-wise difference between the SR output and the ground truth mosaic patch. If there is even a 1-pixel registration error between the SR output and the mosaic, high-frequency regions (edges, boundaries) get heavily penalized. This means:

A sharp but slightly misaligned SR output scores worse than a blurry but registration-tolerant bicubic output.
The training loss (L1 pixel loss) has the same problem — it rewards blur over sharpness when registration is imperfect.
The evaluation metric and the training objective are both corrupted by the same registration noise.

This is documented further in Inference Evaluation.

3. Training data limitations¶

The training set was 673-675 tiles from a single site (Carinalli Ranch), with only 8-9 frames per tile, extracted from the few regions where enough frame overlap existed (source: raw/research/brainstorming_notes.md, lines 11-13). The model had:

No diversity — one crop type, one thermal profile, one set of parallax conditions.
Small effective dataset — 675 tiles is modest for a deep SR network.
Training/inference mismatch — tiles came from dense-overlap pockets, but inference ran on arbitrary regions (source: raw/research/brainstorming_notes.md, lines 26-30).

4. End-to-end training instability¶

As documented in Training Instability, the end-to-end training of alignment + fusion + SR created gradient competition that may have prevented the model from converging to a useful solution at all.

The Anti-Pattern¶

Do not evaluate MFSR with plain PSNR when the ground truth has registration noise.

Concrete rules:

Use shift-tolerant metrics for any evaluation involving homography-warped ground truth. ProbaV-style shift-agnostic scoring (search over sub-pixel translations, report best PSNR) directly addresses this. Contextual Loss and E-LPIPS are alternatives (source: raw/research/brainstorming_notes.md, lines 86-94).
Use LR-consistency as a complementary metric. Downsample the SR output and compare against the original LR input — this requires no ground truth alignment at all and catches hallucination (source: raw/research/brainstorming_notes.md, lines 269-275).
Always report bicubic as a baseline. If your MFSR network cannot beat bicubic, something is fundamentally wrong — do not ship results or draw conclusions until this is resolved.
Address alignment before reconstruction. If the alignment module is broken (spatially-invariant where it needs to be spatially-varying), no amount of fusion or reconstruction improvement will help. Fix alignment first.

Training Instability — the training problems that prevented convergence
Inference Evaluation — the evaluation pipeline that obscured the problem
Experiment Tracking — the tracking failures that made diagnosis impossible