HighRes-net Paper¶

"HighRes-net: Recursive Fusion for Multi-Frame Super-Resolution of Satellite Imagery" by Deudon, Lin, Kalaitzis, Cornebise (ElementAI) and Goytom, Sankaran, Arefin, Kahou, Michalski (Mila). Published at ICLR 2020. The paper describes the architecture that won ESA's Kelvin competition on PROBA-V satellite multi-frame super-resolution.

Provenance: raw/papers/highres-net/highres-net.md

The ESA Kelvin Competition¶

The Kelvin competition was organized by ESA's Advanced Concepts Team to benchmark multi-frame super-resolution on the PROBA-V satellite dataset. The dataset contains 1450 scenes (RED and NIR bands) from 74 hand-selected Earth regions. Each scene has one 384x384 HR image (100m/pixel) and 9+ LR views at 128x128 (300m/pixel) -- a 3x scale factor. The metric was cPSNR (clear Peak Signal-to-Noise Ratio), which corrects for brightness bias and cloud masks. Scores are normalized so that <1.0 means "better than the ESA baseline."

This is one of the few MFSR datasets with naturally occurring LR/HR pairs from different physical cameras, rather than synthetic downsampling. The paper makes this point explicitly: models trained on synthetically downsampled data learn to undo a simplistic low-pass filter, which does not generalize to real-world degradation.

Architecture: Encode, Fuse, Decode¶

HighRes-net is an encoder-decoder with three stages, totaling under 600K parameters.

Encode¶

Each LR view is concatenated channel-wise with a reference frame (the pixel-wise median of all views), forming a 2-channel input. This pair passes through:

Conv2d(2 -> 64, 3x3)
PReLU
2 residual blocks (Conv-PReLU-Conv-PReLU with skip connection, 64 channels)
Conv2d(64 -> 64, 3x3)

The encoding is shared across all views (weight sharing). The reference-frame channel is the entire implicit co-registration mechanism -- no explicit shift estimation, optical flow, or learned alignment kernels. The paper calls this "implicit co-registration" and argues the network learns to extract per-view differences relative to the anchor.

The ablation study confirms this matters: training without any reference frame performs worse than the ESA baseline (score 1.0131 vs. 1.0000). The median reference outperforms the mean (0.9532 vs. 0.9690 test), presumably because the median is more robust to cloud-contaminated outlier pixels.

Recursive Fusion¶

Views are padded to the next power of 2 with zero-valued dummy frames (tracked by alpha masks). Fusion proceeds pairwise over log2(K) steps:

Pair hidden states (first half with reversed second half)
Concatenate the pair (128 channels)
Pass through a shared residual block g_theta ("co-registration" block) that operates on the concatenated pair
A fusion block f_theta (Conv2d + PReLU) squashes 128 -> 64 channels
Alpha-residual skip connections zero out contributions from padded dummy views

The same (g_theta, f_theta) blocks are shared across all pairs and all depths. This weight sharing is what keeps the parameter count low and allows handling variable numbers of input views.

Key property: the recursive structure means information propagates globally -- after T = log2(K) steps, every view has influenced the final fused representation. But it is not permutation-invariant: the pairing order matters. The paper mitigates this by randomly shuffling inputs at training time, with a clearance-biased sampling strategy (softmax with temperature beta=50 over clearance scores).

Decode (Upsample)¶

The fused 64-channel feature map at LR resolution is upsampled via:

ConvTranspose2d(64 -> 64, 3x3, stride=3) -- 3x spatial upsampling
PReLU
Conv2d(64 -> 1, 1x1) -- project to single output channel

No output activation. Values are unbounded and only clipped at evaluation time. This is a deliberate design choice -- the paper does not use residual learning (predicting a correction to a bicubic upsampled reference). The decoder directly predicts the SR image.

The optional residual path (bicubic upsample of the reference + learned residual) appears in the architecture table but was not the primary configuration used in the competition.

ShiftNet: Registration-at-the-Loss¶

The paper's second major contribution is ShiftNet-Lanczos, which addresses a fundamental problem: the SR output and the HR ground truth are not pixel-aligned. Without correction, the loss function penalizes even a perfect reconstruction that happens to be shifted by a fraction of a pixel, forcing the network to learn blurry outputs as a compromise.

ShiftNet is a simplified HomographyNet that predicts two parameters (delta_x, delta_y) defining a global translation between the SR output and the HR target. The predicted shift is applied via separable 1D Lanczos convolution kernels, which handle both integer and sub-pixel shifts with minimal ringing artifacts.

Key design decisions:

Training only. ShiftNet is used only during training to provide better gradient signal. At test time, only HighRes-net runs -- ShiftNet (34M parameters, 99% in a single FC layer) is discarded.
Cooperative, not adversarial. Both networks minimize the same loss: L_theta,delta = loss(shifted_SR, HR) + lambda * ||delta||_2. The L2 regularization on the shift magnitude prevents ShiftNet from learning arbitrary large translations.
Differentiable end-to-end. Because Lanczos interpolation is differentiable, gradients flow from the registered loss back through ShiftNet into HighRes-net, improving fusion quality.

The ablation shows the registered loss drops the test score from 0.9671 to 0.9532 -- a meaningful improvement that produces visibly sharper outputs.

Competition Results¶

Method	Final Score
SRResNet (SISR)	1.0084
ESA Baseline	1.0000
SRResNet + ShiftNet	0.9995
ACT (ESA team)	0.9879
SRResNet-6 + ShiftNet	0.9794
HighRes-net	0.9488
HighRes-net+ (ensemble)	0.9477
DeepSUM	0.9474

HighRes-net and DeepSUM were the top two approaches, within 0.0014 of each other. HighRes-net+ (ensemble of K=16 and K=32 models) marginally beat the single model but was still slightly behind DeepSUM on the final leaderboard.

Notable comparisons: - SISR (SRResNet) barely beats the baseline. This confirms MFSR genuinely extracts information from multiple views -- it is not just denoising. - ShiftNet matters. SRResNet alone scores 1.0084; with ShiftNet it drops to 0.9995. The registration-at-the-loss mechanism benefits even single-image approaches. - DeepSUM upsamples first, then fuses. This costs 9x more memory (3x3 spatial expansion before fusion) and takes days to train vs. 9 hours for HighRes-net. The two approaches achieve nearly identical quality, making HighRes-net far more efficient.

Key Design Assumptions¶

What worked well (on PROBA-V)¶

Translation-only misalignment. PROBA-V views are acquired days apart from ~800km altitude. Object parallax is negligible (<0.1m for 50m-tall objects at 600m baseline). Misalignment is almost entirely sub-pixel global translation. This makes implicit co-registration via a reference channel sufficient, and ShiftNet's 2-parameter global shift model appropriate.
Median reference as robust anchor. With cloud-contaminated views, the pixel-wise median naturally rejects outlier pixels without explicit cloud masking. The paper found that ignoring the provided quality masks entirely and relying on the median + clearance-biased sampling worked better than explicit mask usage.
Fuse-then-upsample. Processing everything at LR resolution until the final decode step keeps memory and compute low. This is only viable when misalignment is small (sub-pixel), because large misalignments would be lost in the low-resolution feature space.
Greyscale single-band. Each spectral band (RED, NIR) is processed independently as single-channel images. No cross-band information is exploited.

What did NOT transfer to our LWIR use case¶

The HighRes-net project page documents the full history of our adaptation attempts. The critical mismatches are:

Non-translational misalignment. Our LWIR frames come from an aircraft at ~800-1200m AGL, not a satellite at 800km. At these altitudes, parallax is significant -- a 10m tall tree creates measurable perspective differences between frames captured seconds apart as the aircraft moves. The implicit co-registration (reference channel concatenation) cannot handle rotation, scale changes, or parallax-induced local deformations. This is the single biggest assumption mismatch.
Narrow dynamic range. PROBA-V imagery spans a wide reflectance range (0-65535 uint16 across varied land cover). LWIR thermal frames occupy a narrow band (~29000-34000 in raw counts, corresponding to ~280-310K surface temperature). The signal-to-noise ratio of the super-resolution detail relative to the background is much lower. The original HRNet's unbounded decoder output worked fine for PROBA-V's wide range but produced instability when combined with normalization/denormalization schemes needed for LWIR's narrow range (see the residual learning explosion documented in HighRes-net).
No quality masks. PROBA-V has cloud masks that enable clearance-biased sampling. Our LWIR frames have no equivalent -- there are no clouds at 800m AGL, but there are motion blur, thermal drift, and varying overlap regions. The clearance-based sampling strategy that prevented overfitting on PROBA-V has no direct analog.
Different scale of sub-pixel shifts. PROBA-V shifts are sub-pixel and almost never exceed 2 pixels. Our LWIR frames can have multi-pixel shifts (5-20+ pixels at the LR scale) due to aircraft motion, GPS/INS error, and vibration. ShiftNet's global 2-parameter translation model is insufficient even as a loss-registration mechanism.
No pretrained weights available. The competition-winning weights were never released. Training from scratch on LWIR was required regardless, but the architecture's simplicity (no explicit alignment, no uncertainty estimation) made it a poor starting point for a domain where alignment is the primary challenge.

The Transition to PIUnet¶

The paper's limitations for LWIR motivated the switch to PIUnet Architecture. The key architectural upgrades PIUnet offered:

Problem	HighRes-net	PIUnet
Alignment	Implicit (reference channel)	TERN module (learned 5x5 registration kernels)
Fusion ordering	Order-dependent recursive pairing	Permutation-invariant mean pooling
Output calibration	None	Uncertainty map (sigma_sr)
Residual learning	Optional, not primary	Core design (output = residual + bicubic reference)
Feature extraction	Simple residual blocks	3D convolutions + temporal self-attention (TEFA)

However, PIUnet's TERN alignment is still spatially-invariant (one kernel per frame, applied globally), which proved insufficient for the spatially-varying misalignment in our LWIR data. Both architectures ultimately struggled with the same fundamental problem: our frames need spatially-varying, content-aware alignment that neither implicit co-registration nor global learned kernels can provide. This is documented in Bicubic Gap.

Lessons for Future Work¶

The paper's emphasis on "real degradation, not synthetic" is directly relevant to our project. PROBA-V's natural LR/HR pairs from different cameras avoid the synthetic downsampling bias. Our LWIR dataset similarly uses naturally different resolution captures (different flight altitudes), which means we face the same advantage (no synthetic bias) but also the same challenge (unknown and spatially-varying degradation kernel).

The registration-at-the-loss idea remains sound even if ShiftNet's specific implementation (global 2-parameter translation) is too simple for our case. A more expressive registration-at-the-loss mechanism -- perhaps predicting a deformation field rather than a global shift -- could improve training for any MFSR architecture on our data.

The fuse-then-upsample efficiency argument also holds, but only if frames are well-registered before fusion. When registration is poor, upsampling first (as DeepSUM does) allows the network to work with more spatial information for alignment, at the cost of 9x memory.