Model Inventory
Date: May 2026
Status: Refreshed against the current post-fix Stage 2 artefacts, including the completed temporal-ablation run.
Canonical metrics source: data/models/collision_metrics.json
1 Stage 2 — Collision Risk Model (src/road_risk/model/collision.py)
1.1 Training data
| Item | Value | Source |
|---|---|---|
| Link-year modelling table | 21,675,570 rows | xgb.n_train + xgb.n_test |
| GLM complete-case rows before downsampling | 18,302,830 rows | glm.n_full |
| GLM training rows (after downsampling) | 3,967,414 | glm.n_obs |
| GLM positive rows (collision > 0) | 360,674 | glm.n_pos |
| XGBoost training rows | 17,340,450 | xgb.n_train |
| XGBoost test rows | 4,335,120 | xgb.n_test |
Downsampling: The GLM first keeps complete-case rows for its feature set, then downsamples zero-collision rows to 10× positives (≈ 91% zeros vs 98% in the full table) to keep the statsmodels design matrix tractable. XGBoost trains on the full ~21.7M-row table with zeros filled to 0.
1.2 GLM — Poisson with log-offset
Family / link: Poisson, log link (statsmodels sm.families.Poisson()).
Regularisation: None. Standard MLE.
Offset: log(AADT × link_length_km × 365 / 1e6) — forces the exposure coefficient to 1.
Features (from trained artefact — collision_metrics.json → glm.features):
| # | Feature | Category |
|---|---|---|
| 1 | road_class_ord |
Road structure |
| 2 | form_of_way_ord |
Road structure |
| 3 | is_motorway |
Binary flag |
| 4 | is_a_road |
Binary flag |
| 5 | is_slip_road |
Binary flag |
| 6 | is_roundabout |
Binary flag |
| 7 | is_dual |
Binary flag |
| 8 | is_trunk |
Binary flag |
| 9 | is_primary |
Binary flag |
| 10 | log_link_length |
Geometry |
| 11 | is_covid |
Temporal |
| 12 | year_norm |
Temporal |
| 13 | degree_mean |
Network |
| 14 | betweenness |
Network |
| 15 | betweenness_relative |
Network |
| 16 | dist_to_major_km |
Network |
| 17 | pop_density_per_km2 |
Network |
| 18 | speed_limit_mph_effective |
Speed limit |
| 19 | lanes_imputed |
OSM, imputed |
| 20 | is_unpaved_imputed |
OSM, imputed |
Not in current trained GLM: hgv_proportion and lit. The current network_features.parquet is OSM-enriched: speed_limit_mph_effective is the modelled speed-limit feature, while raw speed_limit_mph is retained only as provenance. Lower-coverage lanes and is_unpaved enter as median-imputed GLM features.
Metrics:
| Metric | Value |
|---|---|
| Pseudo-R² | 0.3472 (in-sample on downsampled training set) |
| Deviance | 1,423,147 |
| Null deviance | 2,180,048 |
| AIC | 2,237,488 |
| Converged | Yes |
1.3 XGBoost — Poisson with base_margin offset
Hyperparameters (hardcoded in train_collision_xgb, lines 322–328):
| Parameter | Value |
|---|---|
objective |
count:poisson |
n_estimators |
500 |
max_depth |
6 |
learning_rate |
0.05 |
subsample |
0.8 |
colsample_bytree |
0.8 |
random_state |
module constant RANDOM_STATE |
n_jobs |
-1 |
Regularisation: None explicitly set (reg_alpha, reg_lambda take XGBoost defaults: reg_alpha=0, reg_lambda=1).
Validation: GroupShuffleSplit(n_splits=1, test_size=0.2) grouped by link_id — all years for a link stay in one fold.
Offset: passed as base_margin=log_offset so the model learns log-rate given exposure, not absolute count.
Features (from trained artefact — collision_metrics.json → xgb.features):
| # | Feature | Category | vs GLM |
|---|---|---|---|
| 1 | road_class_ord |
Road structure | same |
| 2 | form_of_way_ord |
Road structure | same |
| 3–9 | is_motorway … is_primary |
Binary flags | same |
| 10 | log_link_length |
Geometry | same |
| 11 | estimated_aadt |
Exposure | XGBoost only |
| 12 | is_covid |
Temporal | same |
| 13 | year_norm |
Temporal | same |
| 14 | hgv_proportion |
Traffic | XGBoost only |
| 15 | degree_mean |
Network | same |
| 16 | betweenness |
Network | same |
| 17 | betweenness_relative |
Network | same |
| 18 | dist_to_major_km |
Network | same |
| 19 | pop_density_per_km2 |
Network | same |
| 20 | speed_limit_mph_effective |
Speed limit | same |
| 21 | lanes |
OSM | raw in XGBoost |
| 22 | is_unpaved |
OSM | raw in XGBoost |
XGBoost receives estimated_aadt as a raw feature in addition to the log-offset (XGBoost can exploit non-linear interactions with exposure that the offset constrains in the GLM). hgv_proportion was included in XGBoost because its coverage threshold is simply if col in df.columns (no percentage check); it was present at training time. The current XGBoost run includes effective speed, lanes, and unpaved/surface flag features. Raw speed_limit_mph is retained as provenance but is not in the trained feature list. lit is present in network_features.parquet but is not currently in the trained feature list.
Metrics:
| Metric | Value |
|---|---|
| Pseudo-R² | 0.3235 mean across 5 post-fix seeds with temporal features included (range 0.3214-0.3265) |
| Test deviance | 497,289 mean across 5 post-fix seeds |
Comparability caveat: GLM pseudo-R² is in-sample on a downsampled set (~91% zeros); XGBoost is out-of-sample on the true distribution (~98% zeros). The gap should not be read as a clean model horse race — the two metrics are not computed on a common evaluation set or against a common null model. Earlier docs cited XGBoost pseudo-R² around 0.86, but that number came from a pre-fix evaluation surface that was later superseded after a Stage 2 leakage diagnosis. For current project positioning, use the post-fix ~0.32 baseline instead.
1.4 Output
data/models/risk_scores.parquet — one row per link. Key columns: predicted_xgb (mean collisions/year), predicted_glm, residual_glm, risk_percentile (XGBoost rank × 100 / n_links), collision_count, estimated_aadt, hgv_proportion, speed_limit_mph_effective, raw speed_limit_mph, and betweenness_relative. Post-event diagnostic columns such as pct_dark, pct_urban, pct_junction, pct_near_crossing, and mean_speed_limit are excluded from the output contract.
The effective-speed retrain retained 2,167,557 scored links and 21,676 top-1% links. Compared with the pre-effective-speed risk_scores.parquet, Spearman rank correlation across all links was 0.9962 and top-1% Jaccard overlap was 0.9512.