Open Road Risk
  • Home
  • Project
    • Project overview
    • Current model status
    • AI-assisted development
  • Literature
    • Literature overview
    • Literature evidence register
    • Literature-pipeline alignment
    • Crash frequency models
    • Exposure and traffic volume
    • Spatial methods and network risk
    • Junctions and conflict structure
    • Severity modelling
    • Validation and metrics
    • Transferability and open data limits
  • Data Sources
    • Overview
    • STATS19 Collisions
    • OS Open Roads
    • AADF Traffic Counts
    • WebTRIS Sensors
    • Network Model GDB
  • Methodology
    • Methodology Overview
    • Joining the Datasets
    • Feature Engineering
    • Empirical Bayes Shrinkage
  • Exploratory Data Analysis
    • Collision EDA
    • Collision-Exposure Behaviour
    • Vehicle Mix Analysis
    • Road Curvature
    • Months and Days of Week
    • Traffic Volume EDA
    • OSM Coverage
  • Models
    • Modelling Approach
    • Stage 1a: Traffic Volume
    • Stage 1b: Time-Zone Profiles
    • Stage 2: Collision Risk Model
    • Facility Family Split
    • Model Inventory
  • Investigations
    • Investigations overview
    • KSI atlas diagnostic
    • Staffordshire data quality
    • Temporal descriptors evaluation
    • AADF counted-only filter
    • Rank stability harness
    • Zero-calibration diagnostic
  • Outputs
    • Top-risk map
  • Tools
    • ukgeo — UK Geocoder
  • Future Work

On this page

  • Question
  • Method
  • Result
  • Limitations
  • Related artefacts

Rank Stability Harness

Noise floor established; top-1% Jaccard 0.904 and pseudo-R² 0.323 ± 0.003 across five seeds.

How stable are Stage 2 XGBoost rankings across random-seed variation, and what is the noise floor for evaluating future feature additions?

Decision register entry: 2026-04-25 — 5-seed XGBoost rank stability — noise floor established

Question

How stable are Stage 2 XGBoost rankings across random-seed variation, and what is the noise floor for distinguishing a genuine feature-addition gain from seed-level churn?

Method

The harness retrained the Stage 2 XGBoost model across five random seeds (42–46), using the same grouped-by-link cross-validation design. Seed 42 is the production realisation, so the diagnostic also tested whether the production seed was representative of the seed set.

For each seed, the run recorded:

  • held-out pseudo-R²;
  • top-k ranking overlap using pairwise Jaccard similarity;
  • full-rank Spearman correlation;
  • observed per-decile calibration stability.

The follow-up Jaccard investigation used the saved per-seed predictions to inspect why top-k overlap was non-monotonic: top-1000 overlap was lower than top-100 and top-10000 overlap.

Result

Pseudo-R² was stable but not identical across seeds:

seed pseudo-R²
42 0.321444
43 0.321372
44 0.326320
45 0.326529
46 0.321825
mean 0.323498
std 0.002678

Ranking overlap was high at full-rank scale but less stable at narrow top-k thresholds:

threshold pairwise mean Jaccard pairwise minimum Jaccard
Top 100 0.878141 0.851852
Top 1,000 0.870936 0.858736
Top 10,000 0.883074 0.873185
Top 1% 0.903575 0.896574

Full-rank Spearman correlation was extremely high (pairwise mean 0.999140, minimum 0.999069), which means the broad ordering is stable even though exact membership near narrow cut-offs can churn.

Seed 42 is representative: its pseudo-R² is within one cross-seed standard deviation of the mean, its mean top-1% Jaccard against the other seeds is 0.902825, and its mean Spearman correlation against the other seeds is 0.999130.

The practical noise floor is therefore:

A proposed feature or model change should improve pseudo-R² by roughly 0.006 or more (about 1.5× the cross-seed standard deviation) before it is treated as distinguishable from random-seed variation.

The Jaccard follow-up showed that the lower top-1000 overlap is not simply a local predicted-risk spacing issue. Instead, the useful interpretation is operational: narrow top-k thresholds, especially around top-1000, should be treated as fuzzy frontiers. Links close to the threshold may swap in or out across equally valid seed realisations.

Limitations

  • The diagnostic measures random-seed variation only. It does not measure uncertainty from feature engineering choices, spatial split design, AADT uncertainty, or year-to-year temporal instability.
  • The same grouped-link cross-validation design is used for every seed. Any optimism from spatial autocorrelation is shared across seeds and is not diagnosed here.
  • The top-k Jaccard values are sensitive to the chosen thresholds. The exact top-100 or top-1000 membership should not be interpreted as deterministic.
  • The noise floor is calibrated for the current Stage 2 XGBoost setup. A major model-family change would need its own stability check.

Related artefacts

  • reports/rank_stability.md
  • reports/rank_stability_investigation.md
  • data/provenance/rank_stability_provenance.json

Open Road Risk

 

Built with Quarto