Open Road Risk
  • Home
  • Project
    • Project overview
    • Current model status
    • AI-assisted development
  • Background
    • Metrics and methodology
    • Literature evidence register
  • Literature
    • Crash frequency models
    • Exposure and traffic volume
    • Spatial methods and network risk
    • Junctions and conflict structure
    • Severity modelling
    • Validation and metrics
    • Transferability and open data limits
  • Data Sources
    • Overview
    • STATS19 Collisions
    • OS Open Roads
    • AADF Traffic Counts
    • WebTRIS Sensors
    • Network Model GDB
  • Methodology
    • Methodology Overview
    • Joining the Datasets
    • Feature Engineering
    • Empirical Bayes Shrinkage
  • Exploratory Data Analysis
    • Collision EDA
    • Collision-Exposure Behaviour
    • Vehicle Mix Analysis
    • Road Curvature
    • Months and Days of Week
    • Traffic Volume EDA
    • OSM Coverage
  • Models
    • Modelling Approach
    • Stage 1a: Traffic Volume
    • Stage 1b: Time-Zone Profiles
    • Stage 2: Collision Risk Model
    • Facility Family Split
    • Model Inventory
  • Outputs
    • Top-risk map
  • Future Work

On this page

  • 1 Overview
  • 2 Measured AADF — training data
    • 2.1 Spatial coverage of count points
    • 2.2 AADF distribution by road type
    • 2.3 Annual trends including COVID
    • 2.4 HGV proportion by road type
  • 3 Stage 1a model
    • 3.1 Design summary
    • 3.2 Cross-validation results
    • 3.3 External validation
  • 4 Full-network AADT estimates
    • 4.1 Distribution by road classification
    • 4.2 Year-on-year trends
    • 4.3 Geographic distribution — all links (2023)
    • 4.4 Sanity check — estimated vs measured AADF (2023)
  • 5 Summary

Traffic Volume EDA

1 Overview

Stage 1a estimates Annual Average Daily Traffic (AADT) for every OS Open Roads link in the study area. Directly measured AADF count points cover only a minority of road links — particularly minor and unclassified roads — so a gradient-boosted regressor is trained on those measured points and applied to the full 2.1M-link network.

This page covers:

  • The measured AADF count-point data used for training
  • How the model was built and validated
  • The resulting full-network AADT estimates and their geographic pattern
from pathlib import Path
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

from road_risk.config import _ROOT as ROOT
RANDOM_STATE = 42

try:
    import geopandas as gpd
    HAS_GPD = True
except Exception:
    HAS_GPD = False

aadf = pd.read_parquet(ROOT / "data/processed/aadf/aadf_clean.parquet")

aadt_path = ROOT / "data/models/aadt_estimates.parquet"
aadt = pd.read_parquet(aadt_path) if aadt_path.exists() else pd.DataFrame()

if HAS_GPD:
    or_path = ROOT / "data/processed/shapefiles/openroads.parquet"
    openroads = gpd.read_parquet(or_path) if or_path.exists() else None
else:
    openroads = None

val_path = ROOT / "data/models/aadt_validation_summary.parquet"
val_summary = pd.read_parquet(val_path) if val_path.exists() else pd.DataFrame()

print(f"AADF count points : {aadf['count_point_id'].nunique():,} sites × {aadf['year'].nunique()} years ({aadf.shape[0]:,} rows)")
print(f"AADT estimates    : {aadt['link_id'].nunique():,} links × {aadt['year'].nunique()} years ({len(aadt):,} rows)")
if openroads is not None:
    print(f"OS Open Roads     : {len(openroads):,} links")
AADF count points : 14,193 sites × 10 years (105,661 rows)
AADT estimates    : 2,167,557 links × 10 years (21,675,570 rows)
OS Open Roads     : 2,167,557 links

2 Measured AADF — training data

2.1 Spatial coverage of count points

if HAS_GPD and openroads is not None:
    fig, ax = plt.subplots(figsize=(10, 8))

    rng = np.random.default_rng(RANDOM_STATE)
    sample_idx = rng.choice(len(openroads), size=min(200_000, len(openroads)), replace=False)
    openroads.iloc[sample_idx].plot(ax=ax, color="#d0d0d0", linewidth=0.2, zorder=1)

    latest_aadf = (
        aadf.sort_values("year")
        .groupby("count_point_id", as_index=False)
        .last()
    )
    gdf_pts = gpd.GeoDataFrame(
        latest_aadf,
        geometry=gpd.points_from_xy(latest_aadf["longitude"], latest_aadf["latitude"]),
        crs="EPSG:4326",
    )
    gdf_pts.plot(ax=ax, markersize=3, color="#e63946", alpha=0.6, zorder=2)

    ax.set_title(
        f"AADF count-point locations ({gdf_pts['count_point_id'].nunique():,} sites)\n"
        "Red dots = measured sites; grey = full road network (200k link sample)"
    )
    ax.set_axis_off()
    plt.tight_layout()
    plt.show()

2.2 AADF distribution by road type

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

measured = aadf[aadf["all_motor_vehicles"] > 0].copy()

axes[0].hist(np.log(measured["all_motor_vehicles"]), bins=60, color="#457b9d", edgecolor="none")
axes[0].set_xlabel("log(AADF vehicles/day)")
axes[0].set_ylabel("Count")
axes[0].set_title("Distribution of measured AADF (log scale)")

medians = (
    measured.groupby("road_type")["all_motor_vehicles"]
    .median()
    .sort_values(ascending=False)
)
axes[1].barh(medians.index, medians.values / 1000, color="#1d3557")
axes[1].set_xlabel("Median AADF (thousands vehicles/day)")
axes[1].set_title("Median AADF by road type")
for i, v in enumerate(medians.values):
    axes[1].text(v / 1000 + 0.3, i, f"{v/1000:.1f}k", va="center", fontsize=9)

plt.tight_layout()
plt.show()

2.3 Annual trends including COVID

year_stats = (
    measured.groupby("year")["all_motor_vehicles"]
    .agg(["median", "count"])
    .reset_index()
)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(year_stats["year"], year_stats["median"] / 1000, marker="o", color="#1d3557")
for y in [2020, 2021]:
    if y in year_stats["year"].values:
        axes[0].axvline(y, color="#e63946", linestyle="--", alpha=0.5,
                        label="COVID" if y == 2020 else "")
axes[0].set_xlabel("Year")
axes[0].set_ylabel("Median AADF (thousands vehicles/day)")
axes[0].set_title("Median AADF by year (red dashes = COVID years)")
axes[0].set_xticks(year_stats["year"])
axes[0].tick_params(axis="x", rotation=45)
axes[0].legend()

axes[1].bar(year_stats["year"], year_stats["count"], color="#457b9d")
axes[1].set_xlabel("Year")
axes[1].set_ylabel("Observations")
axes[1].set_title("AADF observations per year")
axes[1].set_xticks(year_stats["year"])
axes[1].tick_params(axis="x", rotation=45)

plt.tight_layout()
plt.show()

2.4 HGV proportion by road type

hgv_data = measured.dropna(subset=["hgv_proportion", "road_type"])

fig, ax = plt.subplots(figsize=(8, 4))
road_types = hgv_data["road_type"].value_counts().index.tolist()
box_data = [hgv_data[hgv_data["road_type"] == rt]["hgv_proportion"].values for rt in road_types]
ax.boxplot(box_data, labels=road_types, patch_artist=True,
           boxprops=dict(facecolor="#a8dadc", color="#1d3557"),
           medianprops=dict(color="#e63946", linewidth=2))
ax.set_ylabel("HGV proportion")
ax.set_title("HGV proportion by road type")
plt.tight_layout()
plt.show()


3 Stage 1a model

3.1 Design summary

The model is a HistGradientBoostingRegressor (sklearn) with native missing-value handling. The target is log1p(AADF) de-meaned by year so the model learns road-level deviation from the network-wide annual average. Year means are stored separately and added back at inference — this prevents the year feature from dominating importance and collapsing full-network predictions near the annual mean.

GroupKFold CV grouped by count_point_id ensures the same station cannot appear in both train and test folds across years.

WebTRIS flow features are excluded from Stage 1a: they are unavailable for the 2.1M-link network at inference and their absence would need to be imputed with zeros, which corrupts predictions because HistGBR routes NaN and zero through different tree branches.

Feature Source
road_class_ord AADF road name prefix (M=6, A=5, B=4, other=1)
is_trunk road name prefix
latitude, longitude AADF count point location
is_covid year flag for 2020–2021
hgv_proportion AADF vehicle mix
link_length_km OS Open Roads (snapped via KD-tree)
Network features betweenness, node degree, dist to major road

3.2 Cross-validation results

metrics_path = ROOT / "data/models/collision_metrics.json"
if metrics_path.exists():
    with open(metrics_path) as f:
        all_metrics = json.load(f)
    aadt_metrics = all_metrics.get("aadt", {})
    if aadt_metrics:
        print(f"CV R²   : {aadt_metrics['cv_r2_mean']:.3f} (±{aadt_metrics['cv_r2_std']:.3f})")
        print(f"CV MAE  : {aadt_metrics['cv_mae_mean']:.3f} log-units")
        print(f"n train : {aadt_metrics['n_train']:,}")
    else:
        print("No 'aadt' key in collision_metrics.json.")
else:
    print("collision_metrics.json not found — run: python -m road_risk.model --stage traffic")
No 'aadt' key in collision_metrics.json.

3.3 External validation

Standard CV is optimistic because nearby stations share road class and network context. Two harder holdout schemes probe genuine extrapolation:

  • Local holdout — 20% of count-point IDs withheld at random. Tests gap-filling.
  • Spatial block — all points north of the 75th-percentile latitude withheld. Tests generalisation to a region with no measured support.
if not val_summary.empty:
    for _, row in val_summary.iterrows():
        print(f"  {row['scheme']:10s}: R²={row['r2']:.3f} | "
              f"MAE={row['mae']:.3f} | RMSE={row['rmse']:.3f} "
              f"(n={int(row['n_holdout']):,})")

    fig, ax = plt.subplots(figsize=(6, 3))
    colors = ["#457b9d", "#e63946"]
    bars = ax.barh(val_summary["scheme"], val_summary["r2"], color=colors)
    ax.axvline(0.72, color="black", linestyle="--", linewidth=1, alpha=0.5,
               label="GroupKFold CV R²≈0.72")
    ax.set_xlabel("R² (log scale)")
    ax.set_title("External validation R² by holdout scheme")
    ax.set_xlim(0, 1)
    for bar, val in zip(bars, val_summary["r2"]):
        ax.text(val + 0.005, bar.get_y() + bar.get_height() / 2,
                f"{val:.3f}", va="center", fontsize=10)
    ax.legend()
    plt.tight_layout()
    plt.show()
  local     : R²=0.831 | MAE=0.565 | RMSE=0.750 (n=7,639)
  spatial   : R²=0.792 | MAE=0.650 | RMSE=0.839 (n=9,773)

val_local_path = ROOT / "data/models/aadt_validation_local.parquet"
if val_local_path.exists():
    val_local = pd.read_parquet(val_local_path)
    print("Local holdout — metrics by road type:\n")
    print(val_local[["road_type", "r2", "mae", "rmse", "n"]].to_string(index=False))
Local holdout — metrics by road type:

road_type       r2      mae     rmse      n
    Major 0.600410 0.454266 0.593950 3861.0
    Minor 0.558783 0.677618 0.880564 3778.0

4 Full-network AADT estimates

4.1 Distribution by road classification

if not aadt.empty and openroads is not None:
    est_2023 = aadt[aadt["year"] == 2023].copy()
    est_2023 = est_2023.merge(
        openroads[["link_id", "road_classification"]], on="link_id", how="left"
    )

    fig, axes = plt.subplots(1, 2, figsize=(13, 5))

    axes[0].hist(np.log(est_2023["estimated_aadt"].clip(1)), bins=60,
                 color="#1d3557", edgecolor="none", alpha=0.8)
    axes[0].set_xlabel("log(estimated AADT vehicles/day)")
    axes[0].set_ylabel("Link count")
    axes[0].set_title("Estimated AADT — all 2.1M links (2023, log scale)")

    cls_order = ["Motorway", "A Road", "B Road", "Classified Unnumbered",
                 "Not Classified", "Unclassified", "Unknown"]
    cls_medians = (
        est_2023.groupby("road_classification")["estimated_aadt"]
        .median()
        .reindex([c for c in cls_order if c in est_2023["road_classification"].unique()])
        .dropna()
    )
    axes[1].barh(cls_medians.index, cls_medians.values / 1000, color="#457b9d")
    axes[1].set_xlabel("Median estimated AADT (thousands vehicles/day)")
    axes[1].set_title("Median estimated AADT by road class (2023)")
    for i, v in enumerate(cls_medians.values):
        axes[1].text(v / 1000 + 0.1, i, f"{v/1000:.1f}k", va="center", fontsize=9)

    plt.tight_layout()
    plt.show()

4.2 Year-on-year trends

if not aadt.empty:
    yr_med = aadt.groupby("year")["estimated_aadt"].median()

    fig, ax = plt.subplots(figsize=(9, 4))
    ax.plot(yr_med.index, yr_med.values / 1000, marker="o", color="#1d3557", linewidth=2)
    for y in [2020, 2021]:
        if y in yr_med.index:
            ax.axvline(y, color="#e63946", linestyle="--", alpha=0.5,
                       label="COVID" if y == 2020 else "")
    ax.set_xlabel("Year")
    ax.set_ylabel("Network-wide median estimated AADT (thousands)")
    ax.set_title("Network-wide median estimated AADT by year")
    ax.set_xticks(yr_med.index)
    ax.tick_params(axis="x", rotation=45)
    ax.legend()
    plt.tight_layout()
    plt.show()

4.3 Geographic distribution — all links (2023)

if HAS_GPD and openroads is not None and not aadt.empty:
    est_2023 = aadt[aadt["year"] == 2023][["link_id", "estimated_aadt"]].copy()
    or_plot = openroads.merge(est_2023, on="link_id", how="inner")
    or_plot = or_plot[or_plot.geometry.notna()].copy()

    fig, axes = plt.subplots(1, 2, figsize=(16, 9))

    or_plot.plot(
        column="estimated_aadt",
        ax=axes[0],
        norm=mcolors.LogNorm(
            vmin=max(1, or_plot["estimated_aadt"].quantile(0.05)),
            vmax=or_plot["estimated_aadt"].quantile(0.99),
        ),
        cmap="plasma",
        linewidth=0.3,
        legend=True,
        legend_kwds={"label": "Estimated AADT (log scale)", "shrink": 0.6},
    )
    axes[0].set_title("Estimated AADT — all links (2023)")
    axes[0].set_axis_off()

    major = or_plot[or_plot["road_classification"].isin(["Motorway", "A Road"])].copy()
    major.plot(
        column="estimated_aadt",
        ax=axes[1],
        norm=mcolors.LogNorm(
            vmin=max(1, major["estimated_aadt"].quantile(0.05)),
            vmax=major["estimated_aadt"].quantile(0.99),
        ),
        cmap="plasma",
        linewidth=0.8,
        legend=True,
        legend_kwds={"label": "Estimated AADT (log scale)", "shrink": 0.6},
    )
    axes[1].set_title("Motorways & A Roads only (2023)")
    axes[1].set_axis_off()

    plt.suptitle("Stage 1a full-network AADT estimates", fontsize=13)
    plt.tight_layout()
    plt.show()

4.4 Sanity check — estimated vs measured AADF (2023)

Links that have a directly measured AADF count point within 2 km should have estimates close to the measured value. This checks in-sample calibration.

if HAS_GPD and openroads is not None and not aadt.empty:
    import pyproj
    from scipy.spatial import cKDTree
    from sklearn.metrics import r2_score, mean_absolute_error

    or_bng = openroads.to_crs("EPSG:27700")
    link_xy = np.column_stack([
        or_bng.geometry.centroid.x,
        or_bng.geometry.centroid.y,
    ])
    tree = cKDTree(link_xy)

    wgs_bng = pyproj.Transformer.from_crs("EPSG:4326", "EPSG:27700", always_xy=True)

    aadf_yr = aadf[(aadf["year"] == 2023) & (aadf["all_motor_vehicles"] > 0)].copy()
    e, n = wgs_bng.transform(aadf_yr["longitude"].values, aadf_yr["latitude"].values)
    dists, idx = tree.query(np.column_stack([e, n]), k=1, distance_upper_bound=2000)
    valid = dists < 2000

    aadf_yr = aadf_yr.reset_index(drop=True)
    aadf_yr["_link_id"] = None
    aadf_yr.loc[valid, "_link_id"] = openroads["link_id"].values[idx[valid]]

    comparison = aadf_yr.dropna(subset=["_link_id"]).merge(
        aadt[aadt["year"] == 2023],
        left_on="_link_id", right_on="link_id",
        how="inner",
    )

    if len(comparison) > 0:
        fig, axes = plt.subplots(1, 2, figsize=(13, 5))

        max_pt = max(comparison["all_motor_vehicles"].max(), comparison["estimated_aadt"].max())
        axes[0].scatter(comparison["all_motor_vehicles"], comparison["estimated_aadt"],
                        s=5, alpha=0.3, color="#1d3557")
        axes[0].plot([1, max_pt], [1, max_pt], "r--", linewidth=1, label="1:1 line")
        axes[0].set_xscale("log"); axes[0].set_yscale("log")
        axes[0].set_xlabel("Measured AADF (vehicles/day)")
        axes[0].set_ylabel("Estimated AADT (vehicles/day)")
        axes[0].set_title(f"Measured vs estimated — {len(comparison):,} matched links (2023)")
        axes[0].legend()

        residuals = np.log(comparison["estimated_aadt"]) - np.log(comparison["all_motor_vehicles"])
        axes[1].hist(residuals, bins=60, color="#457b9d", edgecolor="none")
        axes[1].axvline(0, color="#e63946", linewidth=1.5, linestyle="--")
        axes[1].set_xlabel("log(estimated) − log(measured)")
        axes[1].set_ylabel("Count")
        axes[1].set_title(
            f"Residuals (log scale)\nmedian={residuals.median():.3f}  std={residuals.std():.3f}"
        )

        plt.tight_layout()
        plt.show()

        r2  = r2_score(np.log(comparison["all_motor_vehicles"]),
                       np.log(comparison["estimated_aadt"]))
        mae = mean_absolute_error(np.log(comparison["all_motor_vehicles"]),
                                  np.log(comparison["estimated_aadt"]))
        print(f"In-sample (2023, matched links): R²={r2:.3f} | MAE={mae:.3f} log-units | n={len(comparison):,}")

In-sample (2023, matched links): R²=-0.222 | MAE=1.101 log-units | n=10,562

5 Summary

print("=== Stage 1a output summary ===\n")
if not aadt.empty:
    print(f"Links with estimates : {aadt['link_id'].nunique():,}")
    print(f"Years covered        : {sorted(aadt['year'].unique())}")
    print(f"\nEstimated AADT distribution (2023):")
    print(aadt[aadt['year'] == 2023]['estimated_aadt'].describe().round(0).to_string())

if not val_summary.empty:
    print(f"\nExternal validation R²:")
    for _, row in val_summary.iterrows():
        print(f"  {row['scheme']:10s}: {row['r2']:.3f}")
=== Stage 1a output summary ===

Links with estimates : 2,167,557
Years covered        : [np.int64(2015), np.int64(2016), np.int64(2017), np.int64(2018), np.int64(2019), np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024)]

Estimated AADT distribution (2023):
count    2167557.0
mean        2064.0
std         4129.0
min           60.0
25%          440.0
50%          669.0
75%         1585.0
max       130611.0

External validation R²:
  local     : 0.831
  spatial   : 0.792

Key observations:

  • The road hierarchy is preserved: motorways and A-roads receive substantially higher AADT than unclassified roads.
  • COVID years (2020–2021) show a clear network-wide dip consistent with measured AADF.
  • Local holdout R² exceeds spatial block R², confirming the model is stronger at gap-filling than extrapolating to geography without any measured support.
  • The residual plot for matched links shows approximately zero median bias in log space, with spread driven mainly by minor roads where count-point representativeness is weakest.

Open Road Risk

 

Built with Quarto