Struggle: Training Instability in PIUnet End-to-End Training¶

Context¶

Between October and November 2025, multiple PIUnet training runs exhibited wildly inconsistent behavior. Some runs appeared to be learning — loss curves trended downward, tile-level PSNR improved — but then a seemingly minor change (learning rate tweak, different batch size, activation swap) would produce a run that showed no learning at all. The final model presented at the November 3, 2025 boss meeting scored 0.35-0.99 dB below bicubic on all three test sequences (source: raw/meeting-notes/BOSS_MEETING_SUMMARY.md, lines 80-100).

The frustration was compounded by the inability to isolate what actually changed between runs. Was it the hyperparameter? The random seed? A subtle data loading difference? The answer was usually "we don't know," which made every training run feel like a coin flip.

The Hypothesis¶

Architecture improvements and hyperparameter tuning would produce consistent gains. If we tweaked the right knobs — learning rate schedule, loss function weights, activation functions — the model would converge to something clearly better than bicubic. The PIUnet architecture itself (TERN alignment + TEFA fusion + residual SR head) should be capable of learning useful multi-frame features.

The Failure Mode¶

Training runs were not reproducible. Specific symptoms:

Run-to-run variance exceeded the signal. Two runs with nominally identical configs could differ by 1+ dB on the same validation set, drowning any actual improvement from a deliberate change.
Apparent improvements vanished. A run that scored well on tiles would degrade at full-frame inference — the training/inference gap (source: raw/research/brainstorming_notes.md, lines 26-30) meant tile-level metrics were misleading.
No clear learning trajectory. Loss curves would plateau early or oscillate, making it impossible to distinguish "needs more epochs" from "fundamentally broken config."
Tweaking one component broke another. Improving the alignment module (TERN) could destabilize the fusion module (TEFA), and vice versa.

The Root Cause¶

Two interacting problems:

1. End-to-end gradient competition¶

PIUnet trains the alignment module (TERN), temporal fusion module (TEFA), and super-resolution reconstruction head simultaneously. Gradients from the final L1 pixel loss must propagate through all three stages, creating competing optimization pressures:

The alignment module wants to minimize registration error between frames.
The fusion module wants to learn temporal attention weights.
The SR head wants to reconstruct high-frequency detail.

These objectives are not naturally aligned in a single loss. The alignment gradients can interfere with the fusion gradients, causing training to oscillate rather than converge. The RASD architecture (NTIRE 2025 winner) addresses this explicitly with a two-stage strategy: train alignment first, freeze it, then train fusion and reconstruction (source: raw/research/brainstorming_notes.md, lines 215-220).

2. Lack of experiment tracking discipline¶

Weights & Biases was logging scalar metrics, but nothing linked the code version, config file, dataset version, and checkpoint together in a single auditable record. When a run produced good numbers, there was no way to know exactly what produced them. When a run failed, there was no way to diff it against a successful run. This is documented separately in Experiment Tracking.

The combination was lethal: gradient competition made results inherently noisy, and poor tracking made it impossible to distinguish noise from signal.

The Anti-Pattern¶

Never train alignment and restoration end-to-end without a two-stage strategy.

Concrete rules:

Stage 1 — Alignment only. Train the alignment module (TERN or its replacement) with a direct alignment loss (e.g., L1 between warped frames and reference). Freeze weights when converged.
Stage 2 — Fusion + SR. Train the temporal fusion and super-resolution modules with the alignment module frozen. The reconstruction loss now has a clean gradient path.
Optional Stage 3 — Fine-tuning. Unfreeze all modules with a very low learning rate for joint fine-tuning after both stages have converged independently.

This is not speculative — RASD (43.22 dB, NTIRE 2025) and DeepTrans (NTIRE 2025 runner-up) both use staged training for exactly this reason (source: raw/research/brainstorming_notes.md, lines 215-229).

Additionally: never draw conclusions from a single training run. Any claimed improvement must be validated across at least 3 runs with different seeds to separate signal from noise.

Experiment Tracking — the tracking failures that made this worse
Bicubic Gap — the downstream consequence: PIUnet couldn't beat bicubic
Inference Evaluation — the evaluation challenges that obscured the problem