Struggle: Experiment Tracking Did Not Track What Mattered¶
Context¶
During the PIUnet training campaigns of October-November 2025, Weights & Biases (W&B) was set up and logging scalar metrics (loss curves, PSNR, SSIM) from training runs. On the surface, experiment tracking was in place. In practice, it was nearly useless for diagnosing the training instability documented in Training Instability.
The brainstorming notes from April 2026 explicitly call this out as "experiment tracking chaos" and identify it as a process problem distinct from the architecture and data problems (source: raw/research/brainstorming_notes.md, lines 117-128).
The Hypothesis¶
W&B would track experiments and enable systematic comparison between training runs. By logging metrics, we would be able to identify which changes improved performance and which didn't, converging on an optimal configuration through informed iteration.
The Failure Mode¶
Metrics were logged, but the things that actually varied between runs were not:
-
Code version was not recorded. There was no git commit hash attached to each run. When a training script was modified between runs (e.g., changing the loss function, adjusting data augmentation), the W&B entry had no record of what code produced it. Two runs labeled "L1 loss" might have used different implementations of L1 loss if the code was edited in between.
-
Config was not fully captured. Some hyperparameters were logged, but the full configuration — every flag, every path, every preprocessing choice — was not saved as a single reproducible artifact. The PIUnet TODO document lists at least 6 independent axes of variation: activation function, normalization, loss function, deformable convolutions, pretraining strategy, and altitude conditioning (source:
piunet/TODO_NEXT_IMPROVEMENTS.md, lines 152-297). Any of these could change between runs. -
Dataset version was not pinned. The training dataset evolved over time — from 673 tiles with 8 frames to 675 tiles with 9 frames (source:
raw/meeting-notes/BOSS_MEETING_SUMMARY.md, lines 146-148). Preprocessing also changed (normalization, augmentation). Runs logged against different dataset versions were compared as if they used the same data. -
Checkpoints were not linked. When a training run produced a checkpoint, it was saved to disk but not associated with its W&B run in a way that enabled later retrieval. The model evaluated at the boss meeting (
Results_Residual/model_best_20251031_0058.pt) was a file on disk with a timestamp — not a W&B artifact with full provenance.
The result: when training runs showed wildly different results (see Training Instability), there was no systematic way to determine why. Was it the learning rate change? The code edit? The dataset swap? The random seed? Every debugging session devolved into archaeology — grepping through terminal history, checking file modification times, trying to reconstruct what had been different.
The Root Cause¶
No systematic linking of code version + config + dataset + weights in the same tracking entry. W&B was used as a metric dashboard, not as a reproducibility system. The tools existed (W&B supports artifact logging, code saving, and config capture), but the workflow never enforced their use.
This is a common failure mode in research projects: tracking is set up for the outputs of experiments (metrics, curves) but not for the inputs (code, config, data). Without input tracking, you have a collection of numbers with no way to reproduce or explain any of them.
Contributing factors:
- Solo researcher. With one person managing the entire pipeline, there was no code review or peer pressure to maintain tracking discipline. It was always faster to "just run it" than to properly tag a run.
- Rapid iteration. The TODO document lists improvements expected to take 5 minutes to 2-3 hours each (source:
piunet/TODO_NEXT_IMPROVEMENTS.md, lines 277-295). When changes are small and quick, the overhead of proper tracking feels disproportionate — until you can't figure out what changed. - No automation. Tracking was manual. The training script didn't automatically capture git hash, full config, or dataset metadata. Anything not automated was eventually forgotten.
The Anti-Pattern¶
Never launch a training run without recording the git commit hash, full config, and dataset version in the same tracking entry.
Concrete rules:
- Every training run must log an immutable record containing:
- Git commit hash (and dirty diff, if any uncommitted changes exist)
- Full configuration file or argument dump (every hyperparameter, every path)
- Dataset identifier (which tiles, which frame count, which preprocessing version)
- Random seed
-
Hardware/environment info (GPU, CUDA version, PyTorch version)
-
Checkpoints must be W&B artifacts (or equivalent), linked to their parent run. A checkpoint file on disk with only a timestamp is not traceable.
-
Automate all of the above. The training script should capture this metadata automatically at launch. If it requires manual steps, it will eventually be skipped. A pre-training hook or wrapper script that refuses to start without a clean git state and complete config is ideal.
-
Tag runs with semantic labels. Beyond automated metadata, every run should have a human-readable note explaining what is being tested — e.g., "testing GeLU vs LeakyReLU" or "frequency loss at 0.3 weight." This makes the W&B dashboard navigable months later.
-
Compare runs by diffing their configs. Before drawing any conclusion from two runs, diff their full configs and code versions. If more than one thing changed, the comparison is confounded and the conclusion is invalid.
This is a prerequisite for all future training work, not an optional nice-to-have. The brainstorming notes explicitly state: "This is a prerequisite before any more training runs" (source: raw/research/brainstorming_notes.md, lines 127-128).
Related Pages¶
- Training Instability — the training instability that this tracking failure made undiagnosable
- Bicubic Gap — the poor results that might have been diagnosed sooner with proper tracking
- Inference Evaluation — evaluation results that couldn't be linked back to training configs