Struggle: Training Instability in PIUnet End-to-End Training¶
Context¶
Between October and November 2025, multiple PIUnet training runs exhibited wildly inconsistent behavior. Some runs appeared to be learning — loss curves trended downward, tile-level PSNR improved — but then a seemingly minor change (learning rate tweak, different batch size, activation swap) would produce a run that showed no learning at all. The final model presented at the November 3, 2025 boss meeting scored 0.35-0.99 dB below bicubic on all three test sequences (source: raw/meeting-notes/BOSS_MEETING_SUMMARY.md, lines 80-100).
The frustration was compounded by the inability to isolate what actually changed between runs. Was it the hyperparameter? The random seed? A subtle data loading difference? The answer was usually "we don't know," which made every training run feel like a coin flip.
The Hypothesis¶
Architecture improvements and hyperparameter tuning would produce consistent gains. If we tweaked the right knobs — learning rate schedule, loss function weights, activation functions — the model would converge to something clearly better than bicubic. The PIUnet architecture itself (TERN alignment + TEFA fusion + residual SR head) should be capable of learning useful multi-frame features.
The Failure Mode¶
Training runs were not reproducible. Specific symptoms:
- Run-to-run variance exceeded the signal. Two runs with nominally identical configs could differ by 1+ dB on the same validation set, drowning any actual improvement from a deliberate change.
- Apparent improvements vanished. A run that scored well on tiles would degrade at full-frame inference — the training/inference gap (source:
raw/research/brainstorming_notes.md, lines 26-30) meant tile-level metrics were misleading. - No clear learning trajectory. Loss curves would plateau early or oscillate, making it impossible to distinguish "needs more epochs" from "fundamentally broken config."
- Tweaking one component broke another. Improving the alignment module (TERN) could destabilize the fusion module (TEFA), and vice versa.
The Root Cause¶
Two interacting problems:
1. End-to-end gradient competition¶
PIUnet trains the alignment module (TERN), temporal fusion module (TEFA), and super-resolution reconstruction head simultaneously. Gradients from the final L1 pixel loss must propagate through all three stages, creating competing optimization pressures:
- The alignment module wants to minimize registration error between frames.
- The fusion module wants to learn temporal attention weights.
- The SR head wants to reconstruct high-frequency detail.
These objectives are not naturally aligned in a single loss. The alignment gradients can interfere with the fusion gradients, causing training to oscillate rather than converge. The RASD architecture (NTIRE 2025 winner) addresses this explicitly with a two-stage strategy: train alignment first, freeze it, then train fusion and reconstruction (source: raw/research/brainstorming_notes.md, lines 215-220).
2. Lack of experiment tracking discipline¶
Weights & Biases was logging scalar metrics, but nothing linked the code version, config file, dataset version, and checkpoint together in a single auditable record. When a run produced good numbers, there was no way to know exactly what produced them. When a run failed, there was no way to diff it against a successful run. This is documented separately in Experiment Tracking.
The combination was lethal: gradient competition made results inherently noisy, and poor tracking made it impossible to distinguish noise from signal.
The Anti-Pattern¶
Never train alignment and restoration end-to-end without a two-stage strategy.
Concrete rules:
- Stage 1 — Alignment only. Train the alignment module (TERN or its replacement) with a direct alignment loss (e.g., L1 between warped frames and reference). Freeze weights when converged.
- Stage 2 — Fusion + SR. Train the temporal fusion and super-resolution modules with the alignment module frozen. The reconstruction loss now has a clean gradient path.
- Optional Stage 3 — Fine-tuning. Unfreeze all modules with a very low learning rate for joint fine-tuning after both stages have converged independently.
This is not speculative — RASD (43.22 dB, NTIRE 2025) and DeepTrans (NTIRE 2025 runner-up) both use staged training for exactly this reason (source: raw/research/brainstorming_notes.md, lines 215-229).
Additionally: never draw conclusions from a single training run. Any claimed improvement must be validated across at least 3 runs with different seeds to separate signal from noise.
Related Pages¶
- Experiment Tracking — the tracking failures that made this worse
- Bicubic Gap — the downstream consequence: PIUnet couldn't beat bicubic
- Inference Evaluation — the evaluation challenges that obscured the problem