Open Road Risk
  • Home
  • Project
    • Project overview
    • Current model status
    • AI-assisted development
  • Background
    • Metrics and methodology
    • Literature evidence register
  • Literature
    • Crash frequency models
    • Exposure and traffic volume
    • Spatial methods and network risk
    • Junctions and conflict structure
    • Severity modelling
    • Validation and metrics
    • Transferability and open data limits
  • Data Sources
    • Overview
    • STATS19 Collisions
    • OS Open Roads
    • AADF Traffic Counts
    • WebTRIS Sensors
    • Network Model GDB
  • Methodology
    • Methodology Overview
    • Joining the Datasets
    • Feature Engineering
    • Empirical Bayes Shrinkage
  • Exploratory Data Analysis
    • Collision EDA
    • Collision-Exposure Behaviour
    • Vehicle Mix Analysis
    • Road Curvature
    • Months and Days of Week
    • Traffic Volume EDA
    • OSM Coverage
  • Models
    • Modelling Approach
    • Stage 1a: Traffic Volume
    • Stage 1b: Time-Zone Profiles
    • Stage 2: Collision Risk Model
    • Facility Family Split
    • Model Inventory
  • Outputs
    • Top-risk map
  • Future Work

On this page

  • 1 Stage 2 — Collision Risk Model (src/road_risk/model/collision.py)
    • 1.1 Training data
    • 1.2 GLM — Poisson with log-offset
    • 1.3 XGBoost — Poisson with base_margin offset
    • 1.4 Output

Model Inventory

Date: May 2026
Status: Refreshed against the current post-fix Stage 2 artefacts, including the completed temporal-ablation run.
Canonical metrics source: data/models/collision_metrics.json


1 Stage 2 — Collision Risk Model (src/road_risk/model/collision.py)

1.1 Training data

Item Value Source
Link-year modelling table 21,675,570 rows xgb.n_train + xgb.n_test
GLM complete-case rows before downsampling 18,302,830 rows glm.n_full
GLM training rows (after downsampling) 3,967,414 glm.n_obs
GLM positive rows (collision > 0) 360,674 glm.n_pos
XGBoost training rows 17,340,450 xgb.n_train
XGBoost test rows 4,335,120 xgb.n_test

Downsampling: The GLM first keeps complete-case rows for its feature set, then downsamples zero-collision rows to 10× positives (≈ 91% zeros vs 98% in the full table) to keep the statsmodels design matrix tractable. XGBoost trains on the full ~21.7M-row table with zeros filled to 0.

1.2 GLM — Poisson with log-offset

Family / link: Poisson, log link (statsmodels sm.families.Poisson()).
Regularisation: None. Standard MLE.
Offset: log(AADT × link_length_km × 365 / 1e6) — forces the exposure coefficient to 1.

Features (from trained artefact — collision_metrics.json → glm.features):

# Feature Category
1 road_class_ord Road structure
2 form_of_way_ord Road structure
3 is_motorway Binary flag
4 is_a_road Binary flag
5 is_slip_road Binary flag
6 is_roundabout Binary flag
7 is_dual Binary flag
8 is_trunk Binary flag
9 is_primary Binary flag
10 log_link_length Geometry
11 is_covid Temporal
12 year_norm Temporal
13 degree_mean Network
14 betweenness Network
15 betweenness_relative Network
16 dist_to_major_km Network
17 pop_density_per_km2 Network
18 speed_limit_mph_effective Speed limit
19 lanes_imputed OSM, imputed
20 is_unpaved_imputed OSM, imputed

Not in current trained GLM: hgv_proportion and lit. The current network_features.parquet is OSM-enriched: speed_limit_mph_effective is the modelled speed-limit feature, while raw speed_limit_mph is retained only as provenance. Lower-coverage lanes and is_unpaved enter as median-imputed GLM features.

Metrics:

Metric Value
Pseudo-R² 0.3472 (in-sample on downsampled training set)
Deviance 1,423,147
Null deviance 2,180,048
AIC 2,237,488
Converged Yes

1.3 XGBoost — Poisson with base_margin offset

Hyperparameters (hardcoded in train_collision_xgb, lines 322–328):

Parameter Value
objective count:poisson
n_estimators 500
max_depth 6
learning_rate 0.05
subsample 0.8
colsample_bytree 0.8
random_state module constant RANDOM_STATE
n_jobs -1

Regularisation: None explicitly set (reg_alpha, reg_lambda take XGBoost defaults: reg_alpha=0, reg_lambda=1).
Validation: GroupShuffleSplit(n_splits=1, test_size=0.2) grouped by link_id — all years for a link stay in one fold.
Offset: passed as base_margin=log_offset so the model learns log-rate given exposure, not absolute count.

Features (from trained artefact — collision_metrics.json → xgb.features):

# Feature Category vs GLM
1 road_class_ord Road structure same
2 form_of_way_ord Road structure same
3–9 is_motorway … is_primary Binary flags same
10 log_link_length Geometry same
11 estimated_aadt Exposure XGBoost only
12 is_covid Temporal same
13 year_norm Temporal same
14 hgv_proportion Traffic XGBoost only
15 degree_mean Network same
16 betweenness Network same
17 betweenness_relative Network same
18 dist_to_major_km Network same
19 pop_density_per_km2 Network same
20 speed_limit_mph_effective Speed limit same
21 lanes OSM raw in XGBoost
22 is_unpaved OSM raw in XGBoost

XGBoost receives estimated_aadt as a raw feature in addition to the log-offset (XGBoost can exploit non-linear interactions with exposure that the offset constrains in the GLM). hgv_proportion was included in XGBoost because its coverage threshold is simply if col in df.columns (no percentage check); it was present at training time. The current XGBoost run includes effective speed, lanes, and unpaved/surface flag features. Raw speed_limit_mph is retained as provenance but is not in the trained feature list. lit is present in network_features.parquet but is not currently in the trained feature list.

Metrics:

Metric Value
Pseudo-R² 0.3235 mean across 5 post-fix seeds with temporal features included (range 0.3214-0.3265)
Test deviance 497,289 mean across 5 post-fix seeds

Comparability caveat: GLM pseudo-R² is in-sample on a downsampled set (~91% zeros); XGBoost is out-of-sample on the true distribution (~98% zeros). The gap should not be read as a clean model horse race — the two metrics are not computed on a common evaluation set or against a common null model. Earlier docs cited XGBoost pseudo-R² around 0.86, but that number came from a pre-fix evaluation surface that was later superseded after a Stage 2 leakage diagnosis. For current project positioning, use the post-fix ~0.32 baseline instead.

1.4 Output

data/models/risk_scores.parquet — one row per link. Key columns: predicted_xgb (mean collisions/year), predicted_glm, residual_glm, risk_percentile (XGBoost rank × 100 / n_links), collision_count, estimated_aadt, hgv_proportion, speed_limit_mph_effective, raw speed_limit_mph, and betweenness_relative. Post-event diagnostic columns such as pct_dark, pct_urban, pct_junction, pct_near_crossing, and mean_speed_limit are excluded from the output contract.

The effective-speed retrain retained 2,167,557 scored links and 21,676 top-1% links. Compared with the pre-effective-speed risk_scores.parquet, Spearman rank correlation across all links was 0.9962 and top-1% Jaccard overlap was 0.9512.


Open Road Risk

 

Built with Quarto