Methodology
What this section covers
The Methodology section documents the decisions and processes that turn raw data sources into model-ready features. It is distinct from the Models section, which covers the predictive models themselves and their performance.
A useful way to think of the split:
- Methodology: how the data is shaped — joining sources together, building features, handling overdispersion in count data.
- Models: what predictions are made from that data — traffic estimation, time-zone profiles, collision risk.
The two sections sit on either side of a clear handoff. Methodology produces road_link_annual.parquet with one row per link per year and a fixed schema. Models consume that table and produce predictions.
Pages in this section
Joining the Datasets — how STATS19 collisions, AADF count points, WebTRIS sensors, and OS Open Roads links are spatially joined into a single road-link-by-year table. Includes the four-dimension weighted snap, distance caps, and quality tracking.
Feature Engineering — link-level features built on top of the joined table: network centrality, road geometry, OSM attributes, population density. Documents the imputation rules for sparse OSM fields and the speed_limit_mph_effective tiered fallback.
Empirical Bayes Shrinkage — the post-modelling adjustment that handles overconfidence on links with zero observed collisions. Explains the choice of shrinkage parameter k and the operational impact on rankings.
Methodology vs Models
A few decisions sit on the boundary between sections:
- Year handling is a methodology question (how to encode time) but interacts directly with model choice. The pipeline uses year as a categorical feature; the COVID years (2020–2021) are absorbed by year-specific fixed effects. An
is_covidflag is the natural alternative if year is not already in the model. - Group-aware cross-validation is a model-choice decision but depends on methodology — the grouping key (
count_point_id,site_id,link_id) is determined by what the joining pipeline produces. - Forbidden post-event columns — several derived columns are excluded from modelling because they are computed from collision records themselves. This is enforced in the modelling code but the list is documented as a methodology decision.
These cross-cutting items are documented on the page where they have most impact. Cross-references run in both directions.
What is not in Methodology
- Individual model details (architectures, hyperparameters, metrics) are in Models.
- Exploratory analysis of source data is in Exploratory Data Analysis — what the raw data looks like before processing.
- Source dataset descriptions are in Data Sources — provider, format, fields, project location.