Feature Engineering and Model Pipeline

1 Why this matters

Important

The model pipeline makes several design choices that are load-bearing but not obvious from reading any single function. This page documents what goes in, what comes out, and why — in particular the decisions that keep the model from quietly producing biased or nonsensical risk scores.

The collision rate modelling problem has three awkward properties:

Zero-collision links are genuine low-risk observations — not missing data. An earlier pipeline (the now-deprecated features.py) silently filtered to collision-positive links only, producing a biased training table that the model couldn’t recover from.
Traffic exposure is sparse — AADF directly measures count points, but the expanded OS Open Roads network has over 2 million links. A per-link exposure denominator has to be estimated before rate modelling is possible.
Collisions per link-year are rare events — most link-years have zero collisions. Standard regression assumptions break; Poisson with log-offset is the correct framing, but has its own engineering issues at scale.

2 What this page answers

What are the pipeline stages, and what does each produce?
How is AADT estimated for links without a nearby count point?
How is the collision rate model specified, and why Poisson?
Which features actually enter the model, and which are diagnostic only?
What are the main engineering trade-offs and their consequences?

3 Pipeline overview

Three stages, invoked from python -m road_risk.model --stage <name>:

Stage	File	Input	Output
1a — AADT estimation	`model/aadt.py`	AADF count points	`aadt_estimates.parquet` (all links × years)
1b — Time-zone profiles	`model/timezone_profile.py`	WebTRIS + Stage 1a AADT	`timezone_profiles.parquet` (context / future exposure weighting)
2 — Collision model	`model/collision.py`	Stage 1a + `road_link_annual.parquet`	`risk_scores.parquet` (one row per link)

Stage 1b is currently diagnostic/contextual rather than part of the production collision score. The production collision model consumes Stage 1a total AADT as its exposure denominator.

3.1 A note on `features.py`

The file road_risk/features.py self-deprecates on import. Its feature builders were correct in principle but it joined only collision-positive link-years, producing a biased training table. The current pipeline builds its own feature table inside collision.py — see below.

4 Stage 1a: AADT estimation

4.1 The problem

AADF provides measured flows at count points across the expanded study area, but OS Open Roads contains over 2 million links. The join cap of 2 km (see Joining the Datasets) leaves many links without a direct AADF match. A rate model needs an exposure denominator for every link — otherwise uncounted-road collisions simply drop out.

4.2 Approach

A gradient boosting regressor (HistGradientBoostingRegressor) is trained on AADF count points and applied to every OS Open Roads link.

Target: year-de-meaned log1p(all_motor_vehicles). The model learns each road’s deviation from the network-wide mean for that year; the year mean is stored separately and added back at inference. This prevents a constant per-year feature from dominating the model while still preserving temporal level shifts.

Features used for training:

Feature	Source	Type
`road_class_ord`	AADF `road_name` prefix	Ordinal: Motorway=6, A=5, B=4, other=1
`is_trunk`	AADF `road_name` regex match	Binary
`latitude`, `longitude`	AADF	Coordinates (tree-based model can use raw)
`is_covid`	`year ∈ {2020, 2021}`	Binary
`hgv_proportion`	AADF	Float
`link_length_km`	AADF (via MRDB)	Float, median-imputed per road class
Network features	`network_features.parquet`	See below if available

WebTRIS-derived flow/profile features are intentionally excluded from Stage 1a: they are not available for every Open Roads link at inference time. Time-zone traffic shape is modelled separately in Stage 1b.

Network features (degree_mean, betweenness, dist_to_major_km, pop_density_per_km2, etc.) are joined by snapping each count point to its nearest OS Open Roads link via KD-tree (2 km cap). When present they materially improve CV R² — the train log at runtime reports how many count points had network features attached.

4.3 Network feature definitions

network_features.py turns OS Open Roads into a NetworkX graph where road links are edges, road-link endpoints are nodes, and link length is the shortest path weight. These features are intended to capture network position rather than raw latitude/longitude.

Feature	Definition	Interpretation	Used by
`degree_mean`	Mean graph degree of the link’s start and end nodes	Local junction complexity; higher values usually mean denser urban or interchange structure	Stage 1a, Stage 2
`betweenness`	Approximate node betweenness centrality, averaged across the link’s start and end nodes	Through-route importance; higher values mean the link sits near many shortest paths through the road graph	Stage 1a, Stage 2
`betweenness_relative`	`log1p(betweenness / mean betweenness for the same road_classification)`	Within-road-class centrality; distinguishes unusually central A-roads from ordinary A-roads, rather than just rediscovering road class	Stage 1a, Stage 2
`dist_to_major_km`	Graph distance from the link to the nearest Motorway or A-road node	Feeder-network position; lower values mean closer to the major-road network	Stage 1a, Stage 2
`pop_density_per_km2`	LSOA population density joined to the link centroid	Local demand / urbanisation proxy	Stage 1a, Stage 2
`ruc_class`	ONS 2021 Rural-Urban Classification code joined via the same nearest-LSOA-centroid assignment	Categorical settlement-context feature at LSOA grain	OSM imputation / Stage 2 context (pending)
`ruc_urban_rural`	Top-level Urban/Rural split derived from `ruc_class`	Coarser settlement-context feature for simpler modelling or reporting	OSM imputation / Stage 2 context (pending)
`speed_limit_mph`	Raw OSM-tagged speed limit only	Sparse direct observation of signed / mapped speed limits	Retained as provenance; not modelled in current Stage 2
`speed_limit_mph_effective`	`speed_limit_mph` when present, else UK legal-default lookup by `road_classification` × `ruc_urban_rural`	Fuller-coverage speed-limit proxy	Stage 2 model feature
`speed_limit_source`	Audit label for whether the effective speed came from OSM or a lookup tier	Per-link provenance for speed-limit assignment	Audit only

The current Stage 1a AADT path uses the numeric network/context features above, but not the RUC columns. Those are listed here because they are generated in the broader link-level feature stack and are intended for downstream use rather than the current AADT estimator.

Raw betweenness is heavily right-skewed and strongly related to road class: motorways and A-roads are central partly by definition. betweenness_relative therefore exists as the safer modelling version when the question is “is this link unusually central compared with its peers?” rather than “is this road a major road?”.

4.4 Rural-Urban Classification (RUC)

ruc_class and ruc_urban_rural come from the ONS 2021 Rural-Urban Classification at LSOA grain for England and Wales only. The link-level assignment inherits the existing pop_density_per_km2 pipeline: each road link is matched to the nearest LSOA centroid within the same 2 km cap, then the RUC code is joined by LSOA21CD.

This is a deliberate decadal-stability choice. The modelling window runs from 2015 to 2024, but RUC is treated as structurally slow-moving, so a single 2021 classification avoids an artificial boundary discontinuity at 2021 while still capturing broad urban/rural context across the full period.

The inherited limitation is that nearest-centroid assignment is an approximation, not a polygon-in-polygon spatial overlay. A small number of links near LSOA boundaries may therefore be assigned to a neighbouring LSOA, with the main risk concentrated around urban/rural transition edges where classification changes across the boundary. A polygon containment join would be more accurate, but that is a separate infrastructure change and is not pursued in this pipeline version.

Because the ONS RUC source is E+W only, links that would map to Scottish or Northern Irish small-area codes have NaN for both ruc_class and ruc_urban_rural.

RUC 2021 coding. The file used is the 6-class 2021 variant. Codes are UN1, UF1, RSN1, RSF1, RLN1, RLF1, where U-prefix indicates Urban and R-prefix indicates Rural. The binary ruc_urban_rural is derived by prefix: U -> Urban, R -> Rural.

Coverage. Across the current study area, 15.5% of links (~336k) have NaN for RUC. Investigation showed two mechanisms:

Scottish border-area links (~3% of the network, concentrated above 55.0° N) cannot be assigned because ONS RUC is E+W only.
The remaining nulls are long, sparse, edge-of-study-area rural links whose centroids fall more than 2 km from any E+W LSOA centroid. Missingness correlates strongly with link length (8% null for links under 100m, 66% null for links over 1km).

Interpreting Urban vs Rural. RUC classifies the LSOA a link falls into, not the character of the road itself. An Unclassified residential street in an urban LSOA is Urban even where it looks locally like a quiet side road; a village high street in a rural LSOA is Rural. In this study area, 82% of Unclassified roads are Urban because dense residential street networks in cities dominate the Unclassified link count. This is a feature of LSOA-grain classification, not a classification error.

4.5 Tiered speed-limit defaults

The network feature stack now keeps raw and lookup-filled speed-limit columns separate. speed_limit_mph remains the raw OSM-tagged-only field. speed_limit_mph_effective is the fuller-coverage version created by using raw OSM where present, then applying UK legal speed-limit defaults via road_classification × ruc_urban_rural where OSM is missing. The audit field speed_limit_source records which path each link took, and the boolean speed_limit_mph_imputed marks whether the effective value came from the lookup rather than a raw OSM tag.

The lookup hierarchy is deliberately simple and deterministic:

Motorway = 70 mph
A-road dual carriageway = 70 mph
A-road trunk (non-dual) = 70 mph
A-road single carriageway = 60 mph
B-road Urban = 30 mph
B-road Rural = 60 mph
Minor roads (Unclassified, Not Classified, Classified Unnumbered, Unknown) Urban = 30 mph
Minor roads (Unclassified, Not Classified, Classified Unnumbered, Unknown) Rural = 60 mph

This is a feature-engineering layer, not a claim that the inferred value is the observed signed limit on every road. The lookup supplies the legal default in the absence of an OSM tag, so it will miss local exceptions such as 20 mph zones, signed 40 mph urban arterials, or rural 50 mph sections. It therefore represents the legal maximum implied by road type and settlement context, not necessarily the true enforced limit on that segment.

No extra fallback is applied where ruc_urban_rural is missing. The broader RUC-missing tail is still about 15.5% of links, but only the subset with both missing raw OSM speed and missing ruc_urban_rural stays unresolved in speed_limit_mph_effective. In the current build that unresolved subset is about 189k links (8.7% of the network). Those links are labelled speed_limit_source = "null_no_ruc" and remain unresolved until a broader geography-compatible context layer is added.

4.6 Validation: `GroupKFold` by count point

Using vanilla KFold would leak: the same count point appears in 2019, 2021, and 2023, and a random split could put the 2019 and 2021 readings of station X in the training fold and the 2023 reading in test. The model would then “predict” a station it has effectively memorised.

GroupKFold(n_splits=5) grouped by count_point_id ensures a station appears in exactly one fold. The reported CV R² is genuine out-of-sample performance.

4.7 Applying the estimator

The trained model predicts AADT for every OS Open Roads link × every AADF year. Prediction uses link centroids for latitude/longitude; road_class_ord is derived from road_classification using the same ordinal space as training (Motorway=6, A Road=5, B Road=4, other=1); hgv_proportion and link_length_km are imputed from AADF medians when not available. HistGradientBoostingRegressor receives missing network features as NaN rather than zero so its native missing-value routing is preserved.

The output has link_id × year × estimated_aadt — no link is dropped.

AADT uncertainty is not propagated

Stage 1a produces point estimates with CV MAE reported in log-units. Stage 2 treats these estimates as exact exposure via the log-offset. AADT uncertainty therefore contributes silently to Stage 2 residuals but is not represented in model uncertainty or confidence intervals. This is a standard simplification but worth knowing if you’re interpreting residuals closely.

5 Stage 1b: Time-zone profiles

Builds daily time-zone fractions from WebTRIS cumulative 12/16/18/24-hour windows, then combines those fractions with Stage 1a AADT estimates to produce per-hour flow profiles. Current outputs are used for EDA and future temporal weighting, not the production collision model.

6 Stage 2: Collision model

This is where most of the pipeline’s design choices live.

6.1 Building the training table

build_collision_dataset() constructs a table at link_id × year grain covering every link × every year in the AADT output. Collision counts are left-joined from road_link_annual.parquet and filled with 0 where there’s no match.

This is the fix for the features.py bias: zero-collision links are kept, and are treated as genuine low-risk observations rather than absence of data.

Every row has:

Exposure: estimated_aadt, link_length_km, log_offset = log(AADT × length × 365 / 1e6)
Road structure: road_class_ord, form_of_way_ord, and binary flags for motorway / A-road / slip road / roundabout / dual carriageway / trunk / primary
Link length: log_link_length_km
Temporal: is_covid, year_norm
Network (when available): degree_mean, betweenness, dist_to_major_km, pop_density_per_km2, speed_limit_mph_effective, lanes, is_unpaved; raw speed_limit_mph is retained as provenance
Targets: collision_count, fatal_count, serious_count

6.2 Post-event diagnostics are excluded from Stage 2 features

road_link_annual.parquet also contains contextual aggregates derived from snapped collision records, such as pct_dark, pct_urban, pct_junction, pct_near_crossing, and mean_speed_limit. These are post-event diagnostics, not pre-collision road attributes: they describe the circumstances of observed collisions on a link.

Stage 2 now treats these columns as forbidden model inputs. They are excluded from the training dataframe and from risk_scores.parquet, and collision.py asserts that they cannot enter either the GLM or XGBoost feature list. This is a provenance guard rather than a fix for an active model leak: the current feature lists did not use these columns, but keeping them in the modelling table made future leakage too easy.

6.3 Why Poisson with log-offset

Collision counts are non-negative integers with many zeros. Modelling the count directly with a Gaussian regression would be wrong on both grounds.

The Poisson formulation is:

\[ \text{collision\_count} \sim \text{Poisson}(\lambda) \] \[ \log \lambda = \beta_0 + X\beta + \underbrace{\log(\text{AADT} \cdot \text{length} \cdot 365 \cdot 10^{-6})}_{\text{offset}} \]

The log-offset forces the coefficient on exposure to 1, so the model effectively learns log(rate) = β₀ + Xβ — a rate model expressed as a count regression. Without the offset, the model would learn counts directly and systematically overestimate risk on high-traffic links.

6.4 Two models trained in parallel

Both are fit to the same table but differ in what they optimise and what they can represent.

Poisson GLM (statsmodels) — linear-in-log-rate, interpretable coefficients, standard errors, p-values. Used for diagnostic residual analysis (residual_glm) but not for the final risk ranking, because GLM zero-downsampling biases the intercept and makes absolute predictions unreliable.

XGBoost Poisson — captures non-linearities and interactions. Trained on the full dataset (no downsampling) and drives the final risk_percentile ranking via predicted_xgb. Comparing predicted_glm and predicted_xgb surfaces links where the GLM linear assumption breaks down.

6.5 Engineering detail: GLM zero-downsampling

The full training table is ~18M rows. statsmodels builds a dense design matrix — 18M rows × 15 features × float64 ≈ 2 GB+, which OOMs on many machines.

The GLM training set downsamples zero-collision rows to 10× the number of positive rows. All ~391k positives are kept; ~3.9M zeros are sampled. This keeps the Poisson coefficient estimates stable while making the design matrix tractable.

Consequence: the GLM intercept is biased upward — the effective zero rate in training (~91% zeros instead of ~98%) implies a higher baseline rate than the true population. Feature coefficients are unaffected, so relative risk rankings are correct, but absolute predicted counts inherit the bias. XGBoost uses the full dataset and does not have this issue.

6.6 Engineering detail: XGBoost `base_margin`

XGBoost’s Poisson objective learns log(predicted) directly. To pass the log-offset, the training call uses base_margin=log_offset, which sets the initial prediction in log-space so the model learns rate conditional on exposure rather than absolute count.

Without base_margin, XGBoost would learn high-traffic motorway links have more collisions on average (trivially true by exposure) and over-predict their risk. With it, the model predicts collisions per unit vehicle-km and learns the residual risk after exposure is accounted for.

6.7 Feature imputation

Network features have varying coverage. The GLM applies a coverage-based rule:

> 50% coverage — include as-is, drop rows with NaN.
5–50% coverage — median-impute into a new {col}_imputed feature.
< 5% coverage — drop the feature entirely.

This preserves rows with partial network data at the cost of some coefficient interpretability on imputed columns.

7 Output: pooled risk scores

score_and_save() applies both models to the full dataset, then pools predictions to one row per link.

7.1 Pooling logic

For each link_id, aggregate across years:

Field	Aggregation
`collision_count`	sum
`fatal_count`, `serious_count`	sum
`estimated_aadt`	mean
`predicted_glm`, `predicted_xgb`	mean
Road metadata	first

The key derived columns:

residual_glm = collision_count - predicted_glm × n_years — observed minus expected (diagnostic; GLM intercept is biased by downsampling, so use for relative comparison only).
risk_percentile = rank(mean predicted_xgb) × 100 / n_links — single stable ranking across the network, driven by XGBoost predictions.

7.2 Why pool instead of score per year

The app consuming these scores shouldn’t ask “which year?” — a road link that’s consistently risky is risky. Percentile is computed over the mean predicted rate, which is stable across small year-to-year fluctuations. Temporal features (is_covid, year_norm) remain in training so the model learns the year effect, but the output doesn’t expose it.

7.3 Residuals as under-/over-performance

residual_glm is the signed excess collisions a link has relative to the model’s prediction given its road type, exposure, and network characteristics. Positive → worse than expected, negative → better than expected. This is the concrete form of the “underperforming roads” research question from the project overview.

Warning

Residuals are a statistical summary, not a causal diagnosis. A positive residual means “this link has more collisions than similar links” — it does not identify why, and does not mean any specific intervention would reduce collisions.

8 Feature summary

Condensed view of which columns reach the model and how they’re used.

Category	Features	Used by	Notes
Exposure (via log-offset)	`estimated_aadt`, `link_length_km`	GLM + XGB	Combined into `log_offset`; coefficient forced to 1
Road structure	`road_class_ord`, `form_of_way_ord`, 7 binary flags	GLM + XGB	Ordinals + flags for interaction capture
Link length	`log_link_length_km`	GLM + XGB	Captures non-exposure length effects
Temporal	`is_covid`, `year_norm`	GLM + XGB	Training-only; pooled away in scoring
Network/context	`degree_mean`, `betweenness`, `dist_to_major_km`, `pop_density_per_km2`, `speed_limit_mph_effective`, `lanes`, `is_unpaved`, IMD deciles, `mean_grade`	GLM + XGB	OSM/LSOA/terrain-enriched where present; GLM optional features use median imputation with missingness flags where needed
AADT raw	`estimated_aadt`	XGB only	Also in offset; XGB can use both

9 Known limitations and tradeoffs

9.1 XGBoost validation uses GroupShuffleSplit by link

GroupShuffleSplit(by=link_id, test_size=0.2) ensures no link appears in both training and test sets across years. This gives honest generalisation performance — a link seen in 2019 cannot leak into the 2021 test fold. Reported pseudo-R² reflects out-of-link performance, not in-sample fit.

9.2 Pseudo-R² comparability between GLM and XGBoost

The current GLM pseudo-R² (0.347) and post-fix XGBoost pseudo-R² (~0.323) are computed on different row subsets with different null models. They are not directly comparable.

	GLM	XGBoost
Evaluation set	In-sample (downsampled training table)	Out-of-sample (held-out 20% of links)
Rows	~4.30M (all positives + 1:10 zero sample)	~4.34M (true link-year distribution)
Zero rate in eval set	~91% (downsampled)	~98% (real distribution)
Null model	Intercept-only Poisson on downsampled set	`y_test.mean()` on the real test set

The current Stage 2 metrics reflect the state after the counted-only Stage 1a AADF filter, the Stage 2 post-event provenance guard, OSM network feature enrichment, tiered speed-limit imputation, IMD deprivation features, terrain grade, and the GLM optional-feature imputation refactor. These were shipped to improve methodological correctness and feature coverage rather than to optimise pseudo-R², and attribution of changes across them was not formally ablated. Future feature and modelling work should be evaluated against a saved baseline using the rank-stability harness where possible.

Because the GLM null model sees an artificially balanced zero rate, it is easier to beat — the true 98% zero rate would produce a higher null deviance and a lower pseudo-R² for the same model. Because XGBoost is evaluated on the real held-out distribution against that harder null, the post-fix ~0.323 figure is the current honest baseline. Earlier repo docs cited ~0.86, but that came from a pre-fix evaluation surface that was later superseded after a feature-table leakage diagnosis.

The practical implication: the 0.347 vs ~0.323 gap should not be read as a simple model horse race either, because the evaluation surfaces still differ. A fair comparison would require both models evaluated on the same held-out set against the same null model. This has not been done; recomputing both on a common evaluation set is a known pending improvement.

9.3 GLM intercept bias

Zero-downsampling shifts the baseline rate upward. Absolute predicted counts from the GLM should not be taken at face value; percentiles and relative comparisons are the intended use.

9.4 Stage 1a uncertainty is ignored

AADT estimates feed Stage 2 as exact exposure. AADT validation MAE is reported in log-units by the traffic stage. This is not propagated, so confidence intervals on predicted_glm understate total uncertainty.

9.5 No inter-model ensembling or disagreement analysis

Both models are trained; XGBoost drives the final ranking, GLM is diagnostic. A future extension could flag links where the two disagree substantially — those are candidates for manual review or ensemble weighting.

9.6 Rare-event instability

Fatal-only collision modelling is not implemented as a separate target. fatal_count and serious_count are summed in the output but the primary model is on collision_count (all severities). Severity-specific risk would need a separate model — counts are small enough that regularisation or a hurdle model may be required.

9.7 OSM and network feature availability

The current Stage 2 run is OSM-enriched: speed_limit_mph_effective is included in both the GLM and XGBoost feature lists, while raw speed_limit_mph is retained in the scored output as provenance. lanes and is_unpaved are used by XGBoost and median-imputed for the GLM. Coverage remains class-conditional and uneven, but the speed-limit coverage edge case described below has been resolved by tiered imputation.

9.8 GLM training-set size under OSM coverage

The GLM feature-selection rule uses any column with >50% coverage directly, then applies .dropna() to the assembled feature matrix. The previous raw OSM speed_limit_mph feature had 56.4% coverage, barely cleared that threshold, and reduced the GLM training table to 10,892,180 complete-case link-years. Replacing it with speed_limit_mph_effective resolved that threshold issue: the current retrain uses 18,302,830 complete-case link-years before GLM zero-downsampling, then 3,967,414 rows after the 1:10 positive-to-zero sample. The remaining complete-case loss is dominated by other network/context coverage, especially the RUC-related population/context tail.

XGBoost is unaffected by this GLM threshold because train_collision_xgb() fills missing values with 0 rather than dropping rows. The Stage 2 base table remains 21,675,570 link-years, with the current grouped split yielding xgb.n_train = 17,340,450 (80% of the full table). The production ranking therefore still covers all 2,167,557 links.

The effective-speed retrain left the final rank surface stable relative to the raw-speed baseline: Spearman rank correlation across all links was 0.9962, and the top-1% Jaccard overlap was 0.9512 (21,134 shared links out of 22,218 in the union). The feature swap changed local ordering but did not materially reshuffle the highest-risk set.

The pipeline also reduces feature count if network_features.parquet is missing, producing lower-quality scores without an obvious downstream signal. The train log reports this, but consumers of risk_scores.parquet should check data/models/collision_metrics.json for the actual feature list used by a run.

10 Running the pipeline

# Run all stages
python -m road_risk.model --stage all

# Individual stages
python -m road_risk.model --stage traffic    # Stage 1a: AADT
python -m road_risk.model --stage profile    # Stage 1b: Time-zone profiles
python -m road_risk.model --stage collision  # Stage 2: Risk model

Stage 2 requires Stage 1a output (aadt_estimates.parquet) and will fail with a clear error if it’s missing.

11 Next steps

risk_scores.parquet feeds into:

Application layer — ranking, mapping, and exploratory UIs.
Residual analysis — identifying links with large positive residuals for review and investigation.