Stage 2: Collision Risk Model

Analysis of Stage 2 collision risk model outputs, feature behaviour, metrics, ranking patterns, and diagnostic comparisons.

1 Overview

Stage 2 fits a Poisson collision model on all OS Open Roads network data links × AADF years using estimated AADT as the exposure offset. This is the main exposure-adjusted collision risk model. Two complementary models are used:

Poisson GLM (statsmodels) — interpretable coefficients, fast inference
XGBoost Poisson — captures non-linear interactions, higher predictive power

Both models are trained on link × year data then pooled to a single stable risk score per link. The pooled output (risk_scores.parquet) has one row per link with no year dimension.

from pathlib import Path
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

from road_risk.config import _ROOT as ROOT
RANDOM_STATE = 42
XGB_POSTFIX_PR2 = 0.323498

try:
    import geopandas as gpd
    HAS_GPD = True
except Exception:
    HAS_GPD = False

risk = pd.read_parquet(ROOT / "data/models/risk_scores.parquet")

with open(ROOT / "data/models/collision_metrics.json") as f:
    metrics = json.load(f)

if HAS_GPD:
    or_path = ROOT / "data/processed/shapefiles/openroads.parquet"
    openroads = gpd.read_parquet(or_path) if or_path.exists() else None
else:
    openroads = None

cls_order = ["Motorway", "A Road", "B Road", "Classified Unnumbered",
             "Not Classified", "Unclassified", "Unknown"]
cls_in_data = [c for c in cls_order if c in risk["road_classification"].unique()]

print(f"Links scored     : {len(risk):,}")
print(f"With collisions  : {(risk['collision_count'] > 0).sum():,} "
      f"({(risk['collision_count'] > 0).mean():.1%} of all links)")
print(f"Total collisions : {risk['collision_count'].sum():,}")
print(f"Total fatals     : {risk['fatal_count'].sum():,}")
print(f"\nModel performance:")
print(f"  GLM pseudo-R²  : {metrics['glm']['pseudo_r2']:.3f}")
print(f"  XGB pseudo-R²  : {XGB_POSTFIX_PR2:.3f} (post-fix 5-seed mean)")
print(f"  GLM converged  : {metrics['glm']['converged']}")

Links scored     : 2,167,557
With collisions  : 233,604 (10.8% of all links)
Total collisions : 450,992
Total fatals     : 7,402

Model performance:
  GLM pseudo-R²  : 0.351
  XGB pseudo-R²  : 0.323 (post-fix 5-seed mean)
  GLM converged  : True

2 Model performance

2.1 GLM summary

g = metrics["glm"]
print(f"Training rows (downsampled for GLM) : {g['n_obs']:,}")
print(f"Full dataset rows                   : {g['n_full']:,}")
print(f"Positive link-years                 : {g['n_pos']:,} ({g['n_pos']/g['n_full']:.2%})")
print(f"Pseudo-R² (1 - D/D₀)               : {g['pseudo_r2']:.3f}")
print(f"AIC                                 : {g['aic']:,.0f}")
print(f"\nFeatures ({len(g['features'])}):")
for feat in g["features"]:
    print(f"  {feat}")

Training rows (downsampled for GLM) : 4,303,805
Full dataset rows                   : 21,675,570
Positive link-years                 : 391,255 (1.81%)
Pseudo-R² (1 - D/D₀)               : 0.351
AIC                                 : 2,229,409

Features (34):
  road_class_ord
  form_of_way_ord
  is_motorway
  is_a_road
  is_slip_road
  is_roundabout
  is_dual
  is_trunk
  is_primary
  log_link_length
  is_covid
  year_norm
  hgv_proportion_imputed
  degree_mean_imputed
  betweenness_imputed
  betweenness_relative_imputed
  dist_to_major_km_imputed
  pop_density_per_km2_imputed
  speed_limit_mph_effective_imputed
  lanes_imputed
  is_unpaved_imputed
  imd_decile_imputed
  imd_crime_decile_imputed
  imd_living_indoor_decile_imputed
  mean_grade_imputed
  hgv_proportion_missing
  pop_density_per_km2_missing
  speed_limit_mph_effective_missing
  lanes_missing
  is_unpaved_missing
  imd_decile_missing
  imd_crime_decile_missing
  imd_living_indoor_decile_missing
  mean_grade_missing

2.2 GLM coefficients

glm_path = ROOT / "data/models/collision_glm.pkl"
if glm_path.exists():
    try:
        import statsmodels.api as sm
        glm_result = sm.load(str(glm_path))

        coef = pd.DataFrame({
            "coef":    glm_result.params,
            "ci_low":  glm_result.conf_int()[0],
            "ci_high": glm_result.conf_int()[1],
            "pvalue":  glm_result.pvalues,
        }).drop(index="const", errors="ignore").sort_values("coef")

        bar_colors = ["#e63946" if p < 0.05 else "#adb5bd" for p in coef["pvalue"]]

        fig, ax = plt.subplots(figsize=(8, max(5, len(coef) * 0.45)))
        ax.barh(coef.index, coef["coef"], color=bar_colors, alpha=0.85)
        ax.errorbar(
            coef["coef"], range(len(coef)),
            xerr=[coef["coef"] - coef["ci_low"], coef["ci_high"] - coef["coef"]],
            fmt="none", color="black", linewidth=0.8, capsize=3,
        )
        ax.axvline(0, color="black", linewidth=0.8, linestyle="--")
        ax.set_xlabel("Coefficient (log-rate scale)\nRed = significant p<0.05")
        ax.set_title(
            "Poisson GLM coefficients\n"
            "(positive = higher collision rate per vehicle-km)"
        )
        plt.tight_layout()
        plt.show()

        sig = coef[coef["pvalue"] < 0.05].copy()
        sig["IRR"] = np.exp(sig["coef"]).round(3)
        print("\nSignificant coefficients (p<0.05) with incidence rate ratios:")
        print(sig[["coef", "IRR", "pvalue"]].round(4).to_string())

    except Exception as e:
        print(f"Could not load GLM: {e}")
else:
    print("collision_glm.pkl not found — run: python -m road_risk.model --stage collision")


Significant coefficients (p<0.05) with incidence rate ratios:
                                     coef      IRR  pvalue
betweenness_imputed               -4.9236    0.007     0.0
imd_living_indoor_decile_missing  -1.5819    0.206     0.0
imd_decile_missing                -1.5819    0.206     0.0
imd_crime_decile_missing          -1.5819    0.206     0.0
is_motorway                       -1.1411    0.319     0.0
is_dual                           -0.8257    0.438     0.0
is_slip_road                      -0.7620    0.467     0.0
hgv_proportion_imputed            -0.7220    0.486     0.0
log_link_length                   -0.5649    0.568     0.0
year_norm                         -0.5309    0.588     0.0
hgv_proportion_missing            -0.5113    0.600     0.0
is_a_road                         -0.5105    0.600     0.0
is_unpaved_imputed                -0.4434    0.642     0.0
speed_limit_mph_effective_missing -0.2494    0.779     0.0
is_roundabout                     -0.1851    0.831     0.0
is_unpaved_missing                -0.1483    0.862     0.0
lanes_imputed                     -0.1107    0.895     0.0
is_trunk                          -0.1021    0.903     0.0
is_covid                          -0.0549    0.947     0.0
dist_to_major_km_imputed          -0.0458    0.955     0.0
imd_decile_imputed                -0.0384    0.962     0.0
road_class_ord                    -0.0381    0.963     0.0
imd_living_indoor_decile_imputed  -0.0375    0.963     0.0
mean_grade_imputed                -0.0220    0.978     0.0
imd_crime_decile_imputed          -0.0066    0.993     0.0
pop_density_per_km2_imputed        0.0000    1.000     0.0
mean_grade_missing                 0.0584    1.060     0.0
lanes_missing                      0.0971    1.102     0.0
betweenness_relative_imputed       0.1078    1.114     0.0
is_primary                         0.1437    1.155     0.0
form_of_way_ord                    0.2123    1.236     0.0
degree_mean_imputed                0.5517    1.736     0.0
pop_density_per_km2_missing        4.9141  136.194     0.0

2.3 Deprivation signal

The three IMD deciles run 1 (most deprived) -> 10 (least deprived), so a negative coefficient means less-deprived areas have a lower predicted collision rate, holding exposure and road geometry fixed. The Living Environment Indoors sub-domain is used deliberately — the Outdoors sub-domain contains road-traffic-accident indicators and would leak the outcome. See deprivation.

imd_terms = ["imd_decile", "imd_crime_decile", "imd_living_indoor_decile"]
try:
    rows = []
    for base in imd_terms:
        for t in [base, f"{base}_imputed"]:
            if t in glm_result.params.index:
                rows.append((t, glm_result.params[t],
                             np.exp(glm_result.params[t]), glm_result.pvalues[t]))
    if rows:
        imd_df = pd.DataFrame(rows, columns=["term", "coef", "IRR", "pvalue"])
        print(imd_df.round(4).to_string(index=False))
    else:
        print("No IMD decile terms in the fitted GLM.")
except NameError:
    print("Run the GLM coefficients cell first.")

                            term    coef    IRR  pvalue
              imd_decile_imputed -0.0384 0.9623     0.0
        imd_crime_decile_imputed -0.0066 0.9934     0.0
imd_living_indoor_decile_imputed -0.0375 0.9632     0.0

With ~2M link-years, statistical significance is near-guaranteed; the practical effect is modest (each decile step shifts the rate a few percent) and is associational, not causal — LSOA-level context, not a road attribute.

2.4 XGBoost feature importance

xgb_path = ROOT / "data/models/collision_xgb.json"
if xgb_path.exists():
    try:
        from xgboost import XGBRegressor
        xgb = XGBRegressor()
        xgb.load_model(str(xgb_path))

        importance = pd.Series(
            xgb.feature_importances_,
            index=metrics["xgb"]["features"],
        ).sort_values()

        fig, ax = plt.subplots(figsize=(8, max(5, len(importance) * 0.45)))
        ax.barh(importance.index, importance.values, color="#457b9d", alpha=0.85)
        ax.set_xlabel("Feature importance (gain, normalised)")
        ax.set_title("XGBoost Poisson — feature importance")
        plt.tight_layout()
        plt.show()

    except Exception as e:
        print(f"Could not load XGBoost: {e}")
else:
    print("collision_xgb.json not found.")

2.5 GLM vs XGBoost agreement

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

has_both = risk[["predicted_glm", "predicted_xgb"]].gt(0).all(axis=1)
sample = risk[has_both].sample(min(50_000, has_both.sum()), random_state=RANDOM_STATE)

axes[0].scatter(np.log(sample["predicted_glm"]), np.log(sample["predicted_xgb"]),
                s=2, alpha=0.15, color="#1d3557")
mn = min(np.log(sample["predicted_glm"]).min(), np.log(sample["predicted_xgb"]).min())
mx = max(np.log(sample["predicted_glm"]).max(), np.log(sample["predicted_xgb"]).max())
axes[0].plot([mn, mx], [mn, mx], "r--", linewidth=1)
corr = np.corrcoef(np.log(sample["predicted_glm"]), np.log(sample["predicted_xgb"]))[0, 1]
axes[0].text(0.05, 0.92, f"r = {corr:.3f}", transform=axes[0].transAxes, fontsize=10)
axes[0].set_xlabel("log(GLM predicted rate)")
axes[0].set_ylabel("log(XGBoost predicted rate)")
axes[0].set_title("GLM vs XGBoost predictions (50k sample, log scale)")

pct_data = [risk[risk["road_classification"] == c]["risk_percentile"].values
            for c in cls_in_data]
axes[1].boxplot(pct_data, labels=cls_in_data, patch_artist=True,
                boxprops=dict(facecolor="#a8dadc", color="#1d3557"),
                medianprops=dict(color="#e63946", linewidth=2))
axes[1].set_ylabel("Risk percentile")
axes[1].set_title("Risk percentile distribution by road class")
axes[1].tick_params(axis="x", rotation=30)

plt.tight_layout()
plt.show()

3 Risk score distribution

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

pos = risk[risk["predicted_glm"] > 0]
axes[0].hist(np.log(pos["predicted_glm"]), bins=60, color="#1d3557", edgecolor="none")
axes[0].set_xlabel("log(predicted collisions/year)")
axes[0].set_ylabel("Link count")
axes[0].set_title("Predicted collision rate — all links (log scale)")

axes[1].hist(risk["collision_count"].clip(0, 20), bins=21,
             range=(-0.5, 20.5), color="#457b9d", edgecolor="white", linewidth=0.3)
axes[1].set_yscale("log")
axes[1].set_xlabel("Observed collisions (pooled years, capped at 20)")
axes[1].set_ylabel("Link count (log scale)")
axes[1].set_title(
    f"Observed collision counts\n"
    f"({(risk['collision_count'] == 0).mean():.1%} of links: zero collisions)"
)

plt.tight_layout()
plt.show()

3.1 Risk percentile by road class

cls_med = (
    risk.groupby("road_classification")["risk_percentile"]
    .median()
    .reindex(cls_in_data)
    .dropna()
    .sort_values(ascending=False)
)
colors = ["#e63946" if rc in ["Motorway", "A Road"] else "#457b9d"
          for rc in cls_med.index]

fig, ax = plt.subplots(figsize=(9, 4))
ax.bar(cls_med.index, cls_med.values, color=colors)
ax.axhline(50, color="black", linestyle="--", linewidth=0.8, alpha=0.5)
ax.set_ylabel("Median risk percentile")
ax.set_title("Median risk percentile by road class  (red = major roads)")
ax.tick_params(axis="x", rotation=30)
plt.tight_layout()
plt.show()

print("Risk percentile by road class:")
print(
    risk.groupby("road_classification")["risk_percentile"]
    .agg(["mean", "median", "count"])
    .round(1)
    .reindex(cls_in_data)
    .to_string()
)

Risk percentile by road class:
                       mean  median    count
road_classification                         
Motorway               93.8    97.6     4084
A Road                 90.0    93.7   155538
B Road                 83.2    87.7    89286
Classified Unnumbered  70.5    76.3   190921
Not Classified         29.6    25.9   224878
Unclassified           51.9    53.4  1060014
Unknown                25.8    20.1   442836

4 Residual analysis

Residuals (observed − predicted × n_years) identify where the model systematically under- or over-predicts. Large positive residuals flag links that had more collisions than the model expected given their traffic and road characteristics.

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

axes[0].hist(risk["residual_glm"].clip(-5, 10), bins=80,
             color="#457b9d", edgecolor="none")
axes[0].axvline(0, color="#e63946", linewidth=1.5, linestyle="--")
axes[0].set_yscale("log")
axes[0].set_xlabel("Residual (observed − predicted, pooled years)")
axes[0].set_ylabel("Link count (log scale)")
axes[0].set_title(
    f"GLM residuals\n"
    f"median={risk['residual_glm'].median():.3f}  std={risk['residual_glm'].std():.2f}"
)

pos = risk[risk["predicted_glm"] > 0].sample(50_000, random_state=RANDOM_STATE)
axes[1].scatter(np.log(pos["predicted_glm"]), pos["residual_glm"].clip(-5, 10),
                s=2, alpha=0.15, color="#1d3557")
axes[1].axhline(0, color="#e63946", linewidth=1, linestyle="--")
axes[1].set_xlabel("log(predicted collision rate)")
axes[1].set_ylabel("Residual (clipped −5 to +10)")
axes[1].set_title("Residuals vs predicted rate (overdispersion check)")

plt.tight_layout()
plt.show()

4.1 Residuals by road class

res_by_class = (
    risk.groupby("road_classification")["residual_glm"]
    .agg(["mean", "median", "std", "count"])
    .round(3)
    .reindex(cls_in_data)
    .dropna()
)
print("Residuals by road class:")
print(res_by_class.to_string())

fig, ax = plt.subplots(figsize=(9, 4))
ax.bar(res_by_class.index, res_by_class["median"], color="#457b9d")
ax.axhline(0, color="black", linewidth=0.8, linestyle="--")
ax.set_ylabel("Median residual (observed − predicted)")
ax.set_title("Median GLM residual by road class\n(positive = model under-predicts risk)")
ax.tick_params(axis="x", rotation=30)
plt.tight_layout()
plt.show()

Residuals by road class:
                        mean  median    std    count
road_classification                                 
Motorway              -3.427  -1.918  8.293     4084
A Road                -2.528  -2.134  2.755   155538
B Road                -2.637  -1.930  2.794    89286
Classified Unnumbered -0.409  -0.321  0.990   190921
Not Classified        -0.247  -0.156  0.364   224878
Unclassified          -0.399  -0.244  0.655  1060014
Unknown               -0.234  -0.141  0.397   442836

5 Geographic risk maps

5.1 Full-network risk percentile

if HAS_GPD and openroads is not None:
    or_risk = openroads[["link_id", "road_classification", "geometry"]].merge(
        risk[["link_id", "risk_percentile", "residual_glm", "collision_count"]],
        on="link_id", how="inner",
    )
    or_risk = or_risk[or_risk.geometry.notna()].copy()

    fig, axes = plt.subplots(1, 2, figsize=(16, 9))

    or_risk.plot(
        column="risk_percentile", ax=axes[0],
        cmap="RdYlGn_r", vmin=75, vmax=100,
        linewidth=0.3, legend=True,
        legend_kwds={"label": "Risk percentile", "shrink": 0.6},
    )
    axes[0].set_title("Risk percentile — all links")
    axes[0].set_axis_off()

    major = or_risk[or_risk["road_classification"].isin(["Motorway", "A Road"])].copy()
    major.plot(
        column="risk_percentile", ax=axes[1],
        cmap="RdYlGn_r", vmin=75, vmax=100,
        linewidth=0.8, legend=True,
        legend_kwds={"label": "Risk percentile", "shrink": 0.6},
    )
    axes[1].set_title("Risk percentile — Motorways & A Roads")
    axes[1].set_axis_off()

    plt.suptitle("Stage 2 collision risk scores", fontsize=13)
    plt.tight_layout()
    plt.show()

5.2 Top 1% risk links

if HAS_GPD and openroads is not None:
    top1 = or_risk[or_risk["risk_percentile"] >= 99].copy()
    rng  = np.random.default_rng(RANDOM_STATE)
    bg   = or_risk.iloc[rng.choice(len(or_risk), size=min(150_000, len(or_risk)), replace=False)]

    fig, axes = plt.subplots(1, 2, figsize=(16, 9))

    bg.plot(ax=axes[0], color="#e0e0e0", linewidth=0.2, zorder=1)
    top1.plot(column="collision_count", ax=axes[0], cmap="hot_r",
              linewidth=1.2, legend=True, zorder=2,
              legend_kwds={"label": "Observed collisions (pooled)", "shrink": 0.6})
    axes[0].set_title(f"Top 1% risk links ({len(top1):,})\nColoured by observed collisions")
    axes[0].set_axis_off()

    bg.plot(ax=axes[1], color="#e0e0e0", linewidth=0.2, zorder=1)
    top1.plot(column="residual_glm", ax=axes[1], cmap="RdBu",
              vmin=-5, vmax=5, linewidth=1.2, legend=True, zorder=2,
              legend_kwds={"label": "GLM residual", "shrink": 0.6})
    axes[1].set_title("Top 1% risk links\nColoured by GLM residual")
    axes[1].set_axis_off()

    plt.suptitle("Highest-risk road segments", fontsize=13)
    plt.tight_layout()
    plt.show()

    print(f"\nTop 1% risk links by road class ({len(top1):,} total):")
    print(top1["road_classification"].value_counts().to_string())


Top 1% risk links by road class (21,676 total):
road_classification
A Road                   16030
B Road                    2240
Classified Unnumbered     1768
Motorway                  1410
Unclassified               217
Unknown                     11

5.3 Excess-risk map (residual > 2)

Links where observed collisions substantially exceeded the model’s prediction given traffic volume and road type — potential candidates for engineering review.

if HAS_GPD and openroads is not None:
    excess = or_risk[or_risk["residual_glm"] > 2].copy()

    fig, ax = plt.subplots(figsize=(10, 9))
    bg.plot(ax=ax, color="#e0e0e0", linewidth=0.2, zorder=1)
    excess.plot(column="residual_glm", ax=ax, cmap="Oranges",
                linewidth=1.5, legend=True, zorder=2,
                legend_kwds={"label": "Residual (observed − predicted)", "shrink": 0.6})
    ax.set_title(
        f"Excess-risk links — residual > 2  ({len(excess):,} links)\n"
        "Had substantially more collisions than road type and traffic predict"
    )
    ax.set_axis_off()
    plt.tight_layout()
    plt.show()

6 Summary

print("=== Stage 2 collision model summary ===\n")
print(f"Links scored          : {len(risk):,}")
print(f"Links with collisions : {(risk['collision_count'] > 0).sum():,} "
      f"({(risk['collision_count'] > 0).mean():.1%})")
print(f"\nPoisson GLM pseudo-R² : {metrics['glm']['pseudo_r2']:.3f}")
print(f"XGBoost pseudo-R²     : {XGB_POSTFIX_PR2:.3f} (post-fix 5-seed mean)")
print(f"\nTop 1% risk links     : {(risk['risk_percentile'] >= 99).sum():,}")
print(f"Excess-risk links (residual > 2): {(risk['residual_glm'] > 2).sum():,}")
print(f"\nExcess-risk by road class:")
print(
    risk[risk["residual_glm"] > 2]["road_classification"]
    .value_counts()
    .to_string()
)

=== Stage 2 collision model summary ===

Links scored          : 2,167,557
Links with collisions : 233,604 (10.8%)

Poisson GLM pseudo-R² : 0.351
XGBoost pseudo-R²     : 0.323 (post-fix 5-seed mean)

Top 1% risk links     : 21,676
Excess-risk links (residual > 2): 11,245

Excess-risk by road class:
road_classification
A Road                   3316
Classified Unnumbered    3275
Unclassified             3136
B Road                    636
Unknown                   465
Motorway                  296
Not Classified            121

Key observations:

XGBoost’s current honest post-fix baseline is about pseudo-R² 0.32 across five seeds (0.323 mean, with temporal features included), while the GLM sits at pseudo-R² 0.347 on its own in-sample downsampled surface. Earlier repo docs cited an XGBoost pseudo-R² around 0.86, but that figure came from a pre-fix evaluation surface that was later superseded after a feature-table leakage diagnosis. The two current pseudo-R² values are also still computed on different row subsets and with different null models (GLM: in-sample on a downsampled training set; XGBoost: out-of-sample on the full zero-heavy test set), so the gap should not be read as pure modelling lift. These metrics are from the current run after the counted-only Stage 1a filter, the Stage 2 post-event provenance guard, OSM feature enrichment, IMD features, terrain grade, and the GLM optional-feature imputation refactor. See feature engineering for the full caveat.
The GLM provides interpretable incidence rate ratios per feature; XGBoost predictions complement these with higher predictive power for the risk map.
Residuals show where traffic volume and road geometry alone under-predict observed collisions — likely reflecting local factors not in the feature set (junction design, sight lines, speed compliance).
The top 1% risk links are concentrated on high-traffic A-roads and motorways, consistent with higher exposure and absolute collision counts.
Excess-risk links (residual > 2) are candidates for targeted intervention where infrastructure or behaviour factors drive risk beyond what the model can attribute to road class and traffic alone.