Evaluation Strategy for LWIR MFSR¶

Evaluating super-resolution quality for LWIR thermal imagery is harder than for visible-light natural images. Standard metrics either penalize registration error rather than SR quality, or rely on perceptual models trained on RGB data that do not transfer to single-channel thermal. This article documents the revised evaluation approach developed after the limitations of the flight 21051/21052 evaluation became clear. See Existing Data Limitations for the data problems that compound evaluation difficulty.

Why PSNR Is Problematic for Our Case¶

PSNR remains the standard SR metric, and we will continue to report it. But it has specific failure modes in our setting:

Registration error dominance. Comparing SR output to the LA mosaic ground truth requires warping through homographies. A single-pixel shift in a high-frequency region tanks PSNR far more than actual SR quality differences. Our comparison pathway has an estimated 1-2 pixel registration error floor, which means PSNR measures alignment quality as much as SR quality.
Mosaic is a computational construct. The LA mosaic itself is assembled from overlapping frames. It is a valid ground truth, but the comparison pathway (SR output -> warp -> compare to mosaic) adds noise at every step.
Tiny evaluation set. With the existing data, only a handful of regions have sufficient frame overlap for MFSR inference. Statistical claims from this small sample are unreliable. See Existing Data Limitations for why.

PSNR is kept with caveats: report with confidence intervals, mask edge regions where homography error concentrates, and acknowledge the registration error floor.

Why LPIPS Does Not Work for Thermal¶

LPIPS and other perceptual metrics use feature extractors (typically VGG or AlexNet) trained on RGB natural images. These networks have learned perceptual features specific to three-channel visible-light imagery: color edges, texture patterns, semantic structures.

Single-channel LWIR thermal imagery has fundamentally different statistics. Thermal images have sparse gradients, narrow dynamic range, and no color information. Features that are perceptually important in thermal (subtle temperature gradients, radiometric fidelity) are not what RGB-trained networks detect. BRISQUE and NIQE have the same problem — their natural scene statistics are calibrated to visible-light photography.

Anti-pattern: Do not use RGB-pretrained perceptual metrics as quality indicators for single-channel thermal data without retraining or validation.

The LR-Consistency Check (Most Impactful Change)¶

This is the single most valuable evaluation method identified, because it requires no ground truth.

Principle: Downsample the SR output back to LR resolution and compare against the original LR input. Any deviation represents either hallucination or radiometric error. If the network is doing its job correctly, the low-frequency content of the SR output must exactly match the LR input — the network should only be adding high-frequency detail.

This is the SEN2SR / OpenSR-Test low-frequency constraint, applied as an evaluation metric.

Why it matters: - Completely sidesteps the noisy mosaic alignment problem - Can be computed for every SR output, not just the few regions with ground truth - Serves as both a training constraint (loss term) and evaluation metric - A per-pixel map of the LR-consistency residual flags where the network is inventing content

Hallucination detection: The per-pixel LR-consistency residual map directly identifies hallucinated content without needing ground truth. OpenSR-Test uses exactly this approach for satellite imagery SR.

Sharpness Gain Verification¶

Confirms the network is actually adding high-frequency content rather than just smoothly interpolating.

Method: Compute the high-pass energy ratio between the SR output and bicubic-upsampled input. If the ratio is near 1.0, the network is not adding meaningful detail. If significantly above 1.0, the network is sharpening.

This metric is more trustworthy than PSNR for thermal imagery because it directly measures what we care about (added detail) without being confounded by registration error.

Self-Consistency Checks¶

These methods evaluate SR quality without any external ground truth:

Leave-one-out prediction: Remove one frame from the MFSR stack, run SR on the remaining frames, then check whether the removed frame's content is consistent with the SR output. Disagreement indicates the SR is not faithfully representing the input data.
Forward-backward cycle consistency: SR output downsampled back to LR should match the original (this is the LR-consistency check above). Going further: re-running SR on the downsampled output should produce a result consistent with the original SR output.

Shift-Tolerant and Shift-Agnostic Metrics¶

For cases where ground truth comparison is needed despite registration noise:

ProbaV-style shift-agnostic scoring: Search over sub-pixel translations between SR output and ground truth, report the best PSNR found. This explicitly accounts for registration error.
E-LPIPS / LPIPS-ST: Shift-tolerant variants of LPIPS. Less useful for thermal (see LPIPS limitations above) but worth noting for completeness.
Contextual Loss: Matches features regardless of spatial alignment.

Frequency-Domain Evaluation¶

MTF via slanted edge: Measures the modulation transfer function — how well the SR output preserves contrast at different spatial frequencies. Requires natural edges in the scene.
sFRC (spectral Fourier Ring Correlation): Compares frequency content between SR output and reference. Can detect hallucination as frequency content that appears in the SR but not in the input data.

Recommended Evaluation Stack¶

Metric	Requires GT?	Measures	Priority
LR-consistency residual	No	Hallucination, radiometric fidelity	Highest
Sharpness gain (HP energy ratio)	No	Added detail vs bicubic	High
PSNR/SSIM vs mosaic (shift-agnostic)	Yes	Overall reconstruction quality	Medium
Leave-one-out consistency	No	Multi-frame agreement	Medium
MTF / slanted edge	No	Spatial frequency response	Medium
sFRC	Partial	Hallucination at specific frequencies	Low
LPIPS / BRISQUE / NIQE	Yes/No	Not recommended for LWIR	Do not use

Statistical Rigor¶

Even with better metrics, statistical rigor matters:

Patch gridding: Decompose the few full-frame results into many local patches for larger sample sizes
Bootstrap resampling: Compute confidence intervals on all metrics
Effect size (Cohen's d): Determine if improvements are practically meaningful, not just statistically significant
New data benefit: Data Collection v2 with uniform overlap means evaluation can happen everywhere, not just in cherry-picked dense-overlap spots

Tooling¶

The pyiqa library provides comprehensive metric computation and should be the primary evaluation toolkit. Custom implementations are needed for the LR-consistency check and shift-agnostic PSNR scoring.