Skip to content

Struggle: Training Instability in PIUnet End-to-End Training

Context

Between October and November 2025, multiple PIUnet training runs exhibited wildly inconsistent behavior. Some runs appeared to be learning — loss curves trended downward, tile-level PSNR improved — but then a seemingly minor change (learning rate tweak, different batch size, activation swap) would produce a run that showed no learning at all. The final model presented at the November 3, 2025 boss meeting scored 0.35-0.99 dB below bicubic on all three test sequences (source: raw/meeting-notes/BOSS_MEETING_SUMMARY.md, lines 80-100).

The frustration was compounded by the inability to isolate what actually changed between runs. Was it the hyperparameter? The random seed? A subtle data loading difference? The answer was usually "we don't know," which made every training run feel like a coin flip.

The Hypothesis

Architecture improvements and hyperparameter tuning would produce consistent gains. If we tweaked the right knobs — learning rate schedule, loss function weights, activation functions — the model would converge to something clearly better than bicubic. The PIUnet architecture itself (TERN alignment + TEFA fusion + residual SR head) should be capable of learning useful multi-frame features.

The Failure Mode

Training runs were not reproducible. Specific symptoms:

  • Run-to-run variance exceeded the signal. Two runs with nominally identical configs could differ by 1+ dB on the same validation set, drowning any actual improvement from a deliberate change.
  • Apparent improvements vanished. A run that scored well on tiles would degrade at full-frame inference — the training/inference gap (source: raw/research/brainstorming_notes.md, lines 26-30) meant tile-level metrics were misleading.
  • No clear learning trajectory. Loss curves would plateau early or oscillate, making it impossible to distinguish "needs more epochs" from "fundamentally broken config."
  • Tweaking one component broke another. Improving the alignment module (TERN) could destabilize the fusion module (TEFA), and vice versa.

The Root Cause

Two interacting problems:

1. End-to-end gradient competition

PIUnet trains the alignment module (TERN), temporal fusion module (TEFA), and super-resolution reconstruction head simultaneously. Gradients from the final L1 pixel loss must propagate through all three stages, creating competing optimization pressures:

  • The alignment module wants to minimize registration error between frames.
  • The fusion module wants to learn temporal attention weights.
  • The SR head wants to reconstruct high-frequency detail.

These objectives are not naturally aligned in a single loss. The alignment gradients can interfere with the fusion gradients, causing training to oscillate rather than converge. The RASD architecture (NTIRE 2025 winner) addresses this explicitly with a two-stage strategy: train alignment first, freeze it, then train fusion and reconstruction (source: raw/research/brainstorming_notes.md, lines 215-220).

2. Lack of experiment tracking discipline

Weights & Biases was logging scalar metrics, but nothing linked the code version, config file, dataset version, and checkpoint together in a single auditable record. When a run produced good numbers, there was no way to know exactly what produced them. When a run failed, there was no way to diff it against a successful run. This is documented separately in Experiment Tracking.

The combination was lethal: gradient competition made results inherently noisy, and poor tracking made it impossible to distinguish noise from signal.

The Anti-Pattern

Never train alignment and restoration end-to-end without a two-stage strategy.

Concrete rules:

  1. Stage 1 — Alignment only. Train the alignment module (TERN or its replacement) with a direct alignment loss (e.g., L1 between warped frames and reference). Freeze weights when converged.
  2. Stage 2 — Fusion + SR. Train the temporal fusion and super-resolution modules with the alignment module frozen. The reconstruction loss now has a clean gradient path.
  3. Optional Stage 3 — Fine-tuning. Unfreeze all modules with a very low learning rate for joint fine-tuning after both stages have converged independently.

This is not speculative — RASD (43.22 dB, NTIRE 2025) and DeepTrans (NTIRE 2025 runner-up) both use staged training for exactly this reason (source: raw/research/brainstorming_notes.md, lines 215-229).

Additionally: never draw conclusions from a single training run. Any claimed improvement must be validated across at least 3 runs with different seeds to separate signal from noise.