Open Road Risk Model Inventory

Inventory of current Open Road Risk model stages, artefacts, features, validation metrics, and output files as of the latest refresh.

Date: May 2026
Status: Refreshed against the current post-fix Stage 2 artefacts, including the completed temporal-ablation run.
Canonical metrics source: data/models/collision_metrics.json

1 Stage 2 — Collision Risk Model (`src/road_risk/model/collision.py`)

1.1 Training data

Item	Value	Source
Link-year modelling table	21,675,570 rows	`xgb.n_train + xgb.n_test`
GLM complete-case rows before downsampling	18,302,830 rows	`glm.n_full`
GLM training rows (after downsampling)	3,967,414	`glm.n_obs`
GLM positive rows (collision > 0)	360,674	`glm.n_pos`
XGBoost training rows	17,340,450	`xgb.n_train`
XGBoost test rows	4,335,120	`xgb.n_test`

Downsampling: The GLM first keeps complete-case rows for its feature set, then downsamples zero-collision rows to 10× positives (≈ 91% zeros vs 98% in the full table) to keep the statsmodels design matrix tractable. XGBoost trains on the full ~21.7M-row table with zeros filled to 0.

1.2 GLM — Poisson with log-offset

Family / link: Poisson, log link (statsmodels sm.families.Poisson()).
Regularisation: None. Standard MLE.
Offset: log(AADT × link_length_km × 365 / 1e6) — forces the exposure coefficient to 1.

Features (from trained artefact — collision_metrics.json → glm.features):

#	Feature	Category
1	`road_class_ord`	Road structure
2	`form_of_way_ord`	Road structure
3	`is_motorway`	Binary flag
4	`is_a_road`	Binary flag
5	`is_slip_road`	Binary flag
6	`is_roundabout`	Binary flag
7	`is_dual`	Binary flag
8	`is_trunk`	Binary flag
9	`is_primary`	Binary flag
10	`log_link_length`	Geometry
11	`is_covid`	Temporal
12	`year_norm`	Temporal
13	`degree_mean`	Network
14	`betweenness`	Network
15	`betweenness_relative`	Network
16	`dist_to_major_km`	Network
17	`pop_density_per_km2`	Network
18	`speed_limit_mph_effective`	Speed limit
19	`lanes_imputed`	OSM, imputed
20	`is_unpaved_imputed`	OSM, imputed

Not in current trained GLM: hgv_proportion and lit. The current network_features.parquet is OSM-enriched: speed_limit_mph_effective is the modelled speed-limit feature, while raw speed_limit_mph is retained only as provenance. Lower-coverage lanes and is_unpaved enter as median-imputed GLM features.

Metrics:

Metric	Value
Pseudo-R²	0.3472 (in-sample on downsampled training set)
Deviance	1,423,147
Null deviance	2,180,048
AIC	2,237,488
Converged	Yes

1.3 XGBoost — Poisson with base_margin offset

Hyperparameters (hardcoded in train_collision_xgb, lines 322–328):

Parameter	Value
`objective`	`count:poisson`
`n_estimators`	500
`max_depth`	6
`learning_rate`	0.05
`subsample`	0.8
`colsample_bytree`	0.8
`random_state`	module constant `RANDOM_STATE`
`n_jobs`	-1

Regularisation: None explicitly set (reg_alpha, reg_lambda take XGBoost defaults: reg_alpha=0, reg_lambda=1).
Validation: GroupShuffleSplit(n_splits=1, test_size=0.2) grouped by link_id — all years for a link stay in one fold.
Offset: passed as base_margin=log_offset so the model learns log-rate given exposure, not absolute count.

Features (from trained artefact — collision_metrics.json → xgb.features):

#	Feature	Category	vs GLM
1	`road_class_ord`	Road structure	same
2	`form_of_way_ord`	Road structure	same
3–9	`is_motorway` … `is_primary`	Binary flags	same
10	`log_link_length`	Geometry	same
11	`estimated_aadt`	Exposure	XGBoost only
12	`is_covid`	Temporal	same
13	`year_norm`	Temporal	same
14	`hgv_proportion`	Traffic	XGBoost only
15	`degree_mean`	Network	same
16	`betweenness`	Network	same
17	`betweenness_relative`	Network	same
18	`dist_to_major_km`	Network	same
19	`pop_density_per_km2`	Network	same
20	`speed_limit_mph_effective`	Speed limit	same
21	`lanes`	OSM	raw in XGBoost
22	`is_unpaved`	OSM	raw in XGBoost

XGBoost receives estimated_aadt as a raw feature in addition to the log-offset (XGBoost can exploit non-linear interactions with exposure that the offset constrains in the GLM). hgv_proportion was included in XGBoost because its coverage threshold is simply if col in df.columns (no percentage check); it was present at training time. The current XGBoost run includes effective speed, lanes, and unpaved/surface flag features. Raw speed_limit_mph is retained as provenance but is not in the trained feature list. lit is present in network_features.parquet but is not currently in the trained feature list.

Metrics:

Metric	Value
Pseudo-R²	0.3235 mean across 5 post-fix seeds with temporal features included (range 0.3214-0.3265)
Test deviance	497,289 mean across 5 post-fix seeds

Comparability caveat: GLM pseudo-R² is in-sample on a downsampled set (~91% zeros); XGBoost is out-of-sample on the true distribution (~98% zeros). The gap should not be read as a clean model horse race — the two metrics are not computed on a common evaluation set or against a common null model. Earlier docs cited XGBoost pseudo-R² around 0.86, but that number came from a pre-fix evaluation surface that was later superseded after a Stage 2 leakage diagnosis. For current project positioning, use the post-fix ~0.32 baseline instead.

1.4 Output

data/models/risk_scores.parquet — one row per link. Key columns: predicted_xgb (mean collisions/year), predicted_glm, residual_glm, risk_percentile (XGBoost rank × 100 / n_links), collision_count, estimated_aadt, hgv_proportion, speed_limit_mph_effective, raw speed_limit_mph, and betweenness_relative. Post-event diagnostic columns such as pct_dark, pct_urban, pct_junction, pct_near_crossing, and mean_speed_limit are excluded from the output contract.

The effective-speed retrain retained 2,167,557 scored links and 21,676 top-1% links. Compared with the pre-effective-speed risk_scores.parquet, Spearman rank correlation across all links was 0.9962 and top-1% Jaccard overlap was 0.9512.

1 Stage 2 — Collision Risk Model (src/road_risk/model/collision.py)

1.1 Training data

1.2 GLM — Poisson with log-offset

1.3 XGBoost — Poisson with base_margin offset

1.4 Output

1 Stage 2 — Collision Risk Model (`src/road_risk/model/collision.py`)