Open Road Risk
  • Home
  • Project
    • Project overview
    • Current model status
    • AI-assisted development
  • Background
    • Metrics and methodology
    • Literature evidence register
  • Literature
    • Crash frequency models
    • Exposure and traffic volume
    • Spatial methods and network risk
    • Junctions and conflict structure
    • Severity modelling
    • Validation and metrics
    • Transferability and open data limits
  • Data Sources
    • Overview
    • STATS19 Collisions
    • OS Open Roads
    • AADF Traffic Counts
    • WebTRIS Sensors
    • Network Model GDB
  • Methodology
    • Methodology Overview
    • Joining the Datasets
    • Feature Engineering
    • Empirical Bayes Shrinkage
  • Exploratory Data Analysis
    • Collision EDA
    • Collision-Exposure Behaviour
    • Vehicle Mix Analysis
    • Road Curvature
    • Months and Days of Week
    • Traffic Volume EDA
    • OSM Coverage
  • Models
    • Modelling Approach
    • Stage 1a: Traffic Volume
    • Stage 1b: Time-Zone Profiles
    • Stage 2: Collision Risk Model
    • Facility Family Split
    • Model Inventory
  • Outputs
    • Top-risk map
  • Future Work

On this page

  • What this section covers
  • Pages in this section
  • Methodology vs Models
  • What is not in Methodology

Methodology

What this section covers

The Methodology section documents the decisions and processes that turn raw data sources into model-ready features. It is distinct from the Models section, which covers the predictive models themselves and their performance.

A useful way to think of the split:

  • Methodology: how the data is shaped — joining sources together, building features, handling overdispersion in count data.
  • Models: what predictions are made from that data — traffic estimation, time-zone profiles, collision risk.

The two sections sit on either side of a clear handoff. Methodology produces road_link_annual.parquet with one row per link per year and a fixed schema. Models consume that table and produce predictions.

Pages in this section

Joining the Datasets — how STATS19 collisions, AADF count points, WebTRIS sensors, and OS Open Roads links are spatially joined into a single road-link-by-year table. Includes the four-dimension weighted snap, distance caps, and quality tracking.

Feature Engineering — link-level features built on top of the joined table: network centrality, road geometry, OSM attributes, population density. Documents the imputation rules for sparse OSM fields and the speed_limit_mph_effective tiered fallback.

Empirical Bayes Shrinkage — the post-modelling adjustment that handles overconfidence on links with zero observed collisions. Explains the choice of shrinkage parameter k and the operational impact on rankings.

Methodology vs Models

A few decisions sit on the boundary between sections:

  • Year handling is a methodology question (how to encode time) but interacts directly with model choice. The pipeline uses year as a categorical feature; the COVID years (2020–2021) are absorbed by year-specific fixed effects. An is_covid flag is the natural alternative if year is not already in the model.
  • Group-aware cross-validation is a model-choice decision but depends on methodology — the grouping key (count_point_id, site_id, link_id) is determined by what the joining pipeline produces.
  • Forbidden post-event columns — several derived columns are excluded from modelling because they are computed from collision records themselves. This is enforced in the modelling code but the list is documented as a methodology decision.

These cross-cutting items are documented on the page where they have most impact. Cross-references run in both directions.

What is not in Methodology

  • Individual model details (architectures, hyperparameters, metrics) are in Models.
  • Exploratory analysis of source data is in Exploratory Data Analysis — what the raw data looks like before processing.
  • Source dataset descriptions are in Data Sources — provider, format, fields, project location.

Open Road Risk

 

Built with Quarto