Struggle: Full-Frame Inference Evaluation Was Painful and Unreliable¶

Context¶

After training PIUnet on tiles, the next step was evaluating it on full-frame inference: run the model on an 8-frame MFSR stack from flight 21052 (HA, ~1218m ASL), produce a ~2200x3000 pixel SR output at 3x, and compare it against the corresponding region of the flight 21051 mosaic (LA, ~800m ASL). This comparison required warping the SR output into mosaic coordinate space using homographies, extracting matching mosaic patches, and computing pixel-level metrics.

The inference pipeline (piunet/inference/run_piunet_inference_and_compare.py) orchestrated this full chain: load model, run inference, register to mosaic, extract ground truth patches, compute PSNR/SSIM, and generate visualizations (source: raw/meeting-notes/BOSS_MEETING_SUMMARY.md, lines 182-189).

The Hypothesis¶

We could evaluate SR quality by warping the SR output to mosaic space and computing pixel-level metrics against the LA mosaic ground truth. The mosaic is our best available reference for what the ground looks like at high resolution, so comparing against it should give a meaningful quality signal.

The Failure Mode¶

The evaluation was unreliable at multiple levels:

Registration error contaminated all metrics¶

Every step in the comparison chain added noise:

HA frames to mosaic registration. Each input frame was registered to the LA mosaic using homographies stored in refined_to_mosaic/ (source: raw/meeting-notes/BOSS_MEETING_SUMMARY.md, lines 70-72). These homographies have sub-pixel to multi-pixel error, especially near frame edges and around tall objects with parallax.
SR output to mosaic warping. The SR output lives in the reference frame's coordinate space. Mapping it to the mosaic requires composing homographies, each with its own error budget.
Mosaic construction artifacts. The LA mosaic itself is a computational construct — a stitched composite of many LA frames. Seam lines, blending artifacts, and the mosaic's own registration error all exist in the "ground truth."

The net result: a 1-2 pixel registration error floor existed between the SR output and the mosaic, even with perfect super-resolution. This floor alone could account for PSNR penalties of several dB in high-frequency regions (edges, thermal boundaries).

Only a few regions had enough overlap for MFSR stacks¶

The existing flight data (21052) had sparse frame overlap — only a handful of regions had 8-9+ overlapping HA frames sufficient for MFSR inference (source: raw/research/brainstorming_notes.md, lines 11-13). This meant:

Only 3 test sequences were evaluated at the boss meeting (out of 39 available in group2, but most lacked sufficient overlap or had quality issues).
Statistical power was near zero — you cannot make meaningful claims from 3 samples.
The few testable regions might not be representative of the broader scene.

Results were hard to present convincingly¶

The boss meeting preparation revealed a practical problem: it was very hard to show a full-frame result and say "this is clearly better" (source: raw/research/brainstorming_notes.md, lines 107-109). The numbers said PIUnet was worse than bicubic, the visual comparisons were ambiguous (differences were subtle at full-frame scale), and the uncertainty about whether the metrics themselves were trustworthy undermined any conclusion.

An earlier processing run (PIUnet_Inference_Results_FIXED) had shown 8 dB PSNR due to a normalization bug — a ~4560 DN intensity offset from incorrect scaling (source: raw/meeting-notes/BOSS_MEETING_SUMMARY.md, lines 27-30). The fact that a normalization error this severe could slip through highlighted how fragile the evaluation pipeline was.

The Root Cause¶

The evaluation required running the full pipeline in reverse — from SR output space back to mosaic space — and every step added noise. The fundamental problem is that there is no clean, aligned ground truth for full-frame MFSR inference on this data. The LA mosaic is the closest approximation, but the path from SR output to mosaic comparison is too long and too noisy to produce trustworthy pixel-level metrics.

Specific issues:

Homography-based registration is insufficient for pixel-accurate comparison between frames taken at different altitudes and times. Parallax, atmospheric effects, and lens model residuals all contribute error that homographies cannot model.
PSNR/SSIM are shift-sensitive metrics being applied to shift-contaminated data. These metrics assume perfect pixel alignment between prediction and ground truth. When that assumption is violated, the metrics report registration error, not SR quality.
The evaluation set was too small to average out these errors or establish statistical significance.

The Anti-Pattern¶

Do not rely solely on GT-referenced metrics when ground truth alignment is imperfect. Use LR-consistency and self-consistency checks as primary evaluation signals.

Concrete rules:

LR-consistency check (no ground truth needed). Downsample the SR output back to LR resolution and compare against the original LR input. Any deviation is hallucination or radiometric error. This completely sidesteps the mosaic alignment problem (source: raw/research/brainstorming_notes.md, lines 269-275).
Self-consistency checks. Leave-one-out prediction (run MFSR with N-1 frames, predict the held-out frame) and forward-backward cycle consistency provide quality signals without any external ground truth.
Shift-tolerant metrics for GT comparison. When you must compare against a registered ground truth, use ProbaV-style shift-agnostic scoring (search over sub-pixel translations, report best PSNR), or shift-tolerant perceptual metrics like Contextual Loss (source: raw/research/brainstorming_notes.md, lines 86-94).
Sharpness gain verification. Compute high-pass energy ratio between SR output and bicubic upsampled input. This confirms the network is actually adding high-frequency content, independent of ground truth alignment (source: raw/research/brainstorming_notes.md, lines 282-284).
Patch-level statistics over full-frame. Decompose full-frame results into many local patches, compute metrics per patch, and report distributions with confidence intervals. This separates regions with good registration from regions with poor registration (source: raw/research/brainstorming_notes.md, lines 97-104).
Never present results from fewer than 10 evaluation samples. 3 sequences is not enough to draw any conclusion. Use bootstrap resampling and effect size (Cohen's d) to determine practical significance.

Bicubic Gap — the result that this evaluation pipeline revealed (and possibly distorted)
Training Instability — the training problems that produced the model being evaluated
Experiment Tracking — tracking failures that made it hard to link evaluation results to training configs