Facility-Family Split Diagnostics

Diagnostic design for splitting Stage 2 XGBoost collision risk models by road facility family and comparing against the global model.

Status: Sessions 1 & 2 complete (April 25, 2026). Scope: v1 facility-family split for Stage 2 XGBoost risk ranking. Primary code path: src/road_risk/model/collision.py, with additive wrappers and diagnostics. Reference model: FHWA Highway Safety Manual (HSM) safety performance functions are explicitly site-type specific.

0. Status and scope

This document specifies the v1 facility-family split for Stage 2 XGBoost collision-risk ranking. It is a design document only: no production scoring path, model artefact, or data table should change until this design has been reviewed and an implementation session is opened.

The v1 scope is deliberately narrow:

define the four facility families used for Stage 2 modelling;
train separate XGBoost Poisson models by family;
stitch per-family predictions into a network-wide ranking;
also expose within-family rankings;
evaluate against the existing global Stage 2 XGBoost baseline.

The GLM remains global in v1. Empirical Bayes shrinkage remains unchanged in v1. Per-family EB dispersion, per-family hyperparameter tuning, and hierarchical/partial-pooling models are deferred.

1. Motivation

Stage 2 currently trains one global XGBoost Poisson model across all road links. That model sees road class, form of way, trunk status, primary status, exposure, network centrality, population density, speed-limit, lane, and surface proxies, but it still has to learn one response surface across very different traffic environments.

There are two convergent reasons to split the model by facility family.

First, the HSM/FHWA safety-performance-function precedent is site-type specific. The motivation is not only that different road families have different baseline collision levels; it is that the exposure-to-risk curve can have a different shape across facility types. A motorway, a trunk A-road, an urban local road, and a rural minor road do not merely differ by intercept. They differ in access control, junction density, speed environment, pedestrian and cyclist exposure, maintenance regime, and the kinds of conflict that produce collisions.

Second, EB session 1 found independent evidence that one global dispersion summary is fragile. In reports/eb_dispersion.md, binned method-of-moments NB2 dispersion values vary by about 3,400x across the predicted-risk range, monotonically falling from low- to high-prediction bins. That finding does not prove a family split will help, but it is a strong warning that a single global mean/dispersion structure is leaving systematic structure on the table. quarto/methodology/empirical-bayes-shrinkage.qmd records this as a v2 motivation for per-family or per-bin k; facility-family modelling attacks the corresponding mean-model problem first.

The operational hypothesis is:

per-family XGBoost models improve calibration within families, especially on motorways, where the current global model has a known mean residual of about -3.3;
per-family predictions are easier to interpret operationally because each model is fitted within a more coherent road environment;
headline rank stability may improve modestly, but large improvement is not guaranteed because seed-induced churn near narrow top-k thresholds is a separate mechanism.

2. Family definitions

The v1 family definitions are taken from docs/internal/family-definition-rationale.md. That document contains the decision history and reasoning; this section summarises the frozen definitions rather than re-deriving them.

Family	Definition	Study-area links	Active links in `road_link_annual.parquet`
Motorway	`road_function == "Motorway"`	4,084	2,279
Trunk A-road	`road_function == "A Road"` and `is_trunk == True`	16,011	5,465
Other-Urban	neither of the above and `ruc_urban_rural == "Urban"`	1,366,925	177,296
Other-Rural	neither of the above and `ruc_urban_rural == "Rural"`	780,537	48,564
Total	exactly one family per link	2,167,557	233,604

The family assignment columns come from two existing sources:

road_function and is_trunk from OS Open Roads metadata, already used by build_collision_dataset();
ruc_urban_rural from post-fill data/features/network_features.parquet.

The recent RUC fill resolved the previous 335,692-link no-LSOA gap. About 336k links have ruc_imputed = True, but only 993 links in the rural-default fallback population appear in road_link_annual.parquet. Family assignment is therefore complete for all active modelled links, and the practical modelling impact of the rural default is small. The fill is documented in reports/ruc_fill.md and verified in reports/ruc_fill_verification.md.

form_of_way and is_primary stay as features within each family. They are not used to define v1 families.

3. Per-family modelling approach

3.1 Separate XGBoost models, one per family

v1 trains one XGBoost Poisson regressor per family. Each model is fitted on the link-year rows belonging to links in that family. The wrapper should reuse the current train_collision_xgb() path in src/road_risk/model/collision.py rather than creating a second training implementation.

The v1 XGBoost settings are the same as the current global model:

Parameter	v1 value
`objective`	`count:poisson`
`n_estimators`	500
`max_depth`	6
`learning_rate`	0.05
`subsample`	0.8
`colsample_bytree`	0.8
`n_jobs`	1

The exposure offset is unchanged:

base_margin = log_offset
log_offset = log(estimated_aadt * link_length_km * 365 / 1e6)

The train/test split is also unchanged in principle: use GroupShuffleSplit by link_id, but apply it within each family. All years for a link remain in one split. The seed controls both the split random_state and the XGBoost random_state, matching the current global rank-stability harness.

The feature list is the same as the current global train_collision_xgb() feature list. Some existing features will be constant or nearly constant in particular families. For example, is_motorway is constant in the motorway family, is_trunk is constant in the trunk A family, and the assignment columns road_function and ruc_urban_rural are family-constant by definition even where they are attached for diagnostics rather than used as raw XGBoost features. This is wasteful but simple: XGBoost will not split on zero-variance features. Per-family feature pruning is a v2 candidate if v1 works and the extra cleanup is worth the complexity.

The motorway family is small, about 4k links and roughly 40k link-years before held-out splitting. n_estimators=500 may overfit on that population. The 5-seed evaluation should surface this if it happens, especially as motorway-specific pseudo-R2 volatility or top-k instability. v2 candidates include reduced n_estimators, early stopping, or partial pooling across motorway and trunk A families.

3.2 What does NOT change

The v1 family split does not change:

GLM training, which remains global and unchanged from the current pipeline;
XGBoost hyperparameters, which are kept fixed across families;
the pseudo-R2 definition;
EB shrinkage, including the persisted global positive-event-weighted k;
the canonical data/models/risk_scores.parquet output until per-family evaluation is reviewed.

4. Operational outputs: both global and per-family rankings

The family split should produce two new ranking surfaces while preserving the current global ranking for comparison.

4.1 Global stitched ranking

Each family model produces predicted_xgb_family for its own family link-year rows. Those predictions are pooled to one row per link using the same unit as the current score_collision_models() path:

predicted_xgb_family = mean expected collisions per link-year

The stitching procedure is:

score link-years within each family model;
pool to one row per link, taking mean predicted_xgb_family across years;
concatenate all four family outputs;
rank predicted_xgb_family across all links to produce risk_percentile_family.

This network-wide stitched ranking has the same operational semantics as the current risk_percentile: it answers, “where does this link sit in the whole network?” The current risk_percentile remains unchanged and should be available side-by-side.

Per-family Poisson XGBoost with the same exposure offset should, in principle, produce predictions on the same expected-collisions-per-link-year scale across families. The stitched ranking is therefore meaningful in principle. In practice, family boundaries can create ranking discontinuities if the family-level models are calibrated differently. Section 6.3 makes that an explicit validation diagnostic rather than an assumption.

4.2 Per-family rankings

Each family also produces risk_percentile_within_family, ranked only within that family. This is a different operational object from the stitched ranking. “Top 1% of motorways for traffic-management intervention” and “top 1% of rural minor/local roads for road-safety partnership work” are not the same screening question.

The eventual scored output should include:

Column	Meaning
`family`	one of `motorway`, `trunk_a`, `other_urban`, `other_rural`
`risk_percentile`	current global-model percentile, unchanged
`predicted_xgb_family`	per-family model prediction, pooled to link grain
`risk_percentile_family`	global stitched percentile from per-family models
`risk_percentile_within_family`	percentile within the link’s family

The current production risk_percentile stays unchanged until v1 evaluation supports adoption of either new ranking surface.

5. Implementation design

This is a design document only. The implementation should be small, additive, and reviewable. The first implementation pass should not modify the existing global training path: train_collision_xgb() stays as-is, and the new family module wraps it.

5.1 Proposed file structure

File	Change
`src/road_risk/model/collision.py`	Add helper to assign family from Open Roads and network-feature columns.
`src/road_risk/model/family_split.py`	New module: `train_family_xgb()`, `score_family_xgb()`, `stitch_predictions()`, family assignment logic.
`src/road_risk/diagnostics/family_evaluation.py`	New diagnostic: per-family and combined evaluation, comparison against global baseline.
`src/road_risk/model/rank_stability.py`	Extend to optionally run per-family evaluation.
`data/models/family/seed_<N>/<family>.parquet`	New per-seed per-family score outputs.
`data/models/risk_scores_family.parquet`	Eventually: scored output with both new ranking columns.
`data/provenance/family_split_provenance.json`	New provenance for per-family training run.

5.2 Family assignment

Add a vectorised assign_family() helper that takes columns road_function, is_trunk, and ruc_urban_rural, and returns one of:

motorway
trunk_a
other_urban
other_rural

The precedence is:

if road_function == "Motorway":
    return "motorway"
if road_function == "A Road" and is_trunk:
    return "trunk_a"
if ruc_urban_rural == "Urban":
    return "other_urban"
if ruc_urban_rural == "Rural":
    return "other_rural"
raise ValueError  # should not happen post-RUC-fill

The implementation should verify that every link in road_link_annual.parquet maps to exactly one family, and that the full scored network has zero unknown-family rows.

5.3 Per-family training loop

The wrapper should be intentionally thin:

def train_family_xgb(df, family, seed=42):
    family_df = df[df["family"] == family]
    return train_collision_xgb(family_df, seed=seed)

train_collision_xgb() is called per family without changing its internals. The wrapper is responsible for filtering, row-count checks, logging, and provenance.

5.4 Stitching and scoring

The scoring helper should mirror score_collision_models() but use the per-family XGBoost models:

def score_family_xgb(network_df, models_by_family):
    predictions = []
    for family, model in models_by_family.items():
        family_df = network_df[network_df["family"] == family].copy()
        family_df["predicted_xgb_family"] = model.predict(...)
        predictions.append(family_df)
    scored = pd.concat(predictions)
    return pool_family_predictions(scored)

Pooling should keep the same link-level semantics as current Stage 2:

Column	Aggregation
`collision_count`	sum across years
`fatal_count`	sum across years
`serious_count`	sum across years
`estimated_aadt`	mean across years
`predicted_xgb_family`	mean across years
`family`	first; must be constant by `link_id`

Then compute:

pooled["risk_percentile_family"] = (
    pooled["predicted_xgb_family"].rank(pct=True) * 100
)
pooled["risk_percentile_within_family"] = (
    pooled.groupby("family")["predicted_xgb_family"].rank(pct=True) * 100
)

Exact top-k set selection should use deterministic sorting: predicted_xgb_family desc, link_id asc.

6. Validation plan

Validation decides whether the family split should be adopted, kept as a diagnostic, or deferred.

6.1 Headline metrics: global stitched vs current global

Compare risk_percentile_family against the current global risk_percentile.

Report:

pseudo-R2 on held-out link-years using stitched family predictions;
5-seed Jaccard at k = 100, 1,000, 10,000, and top-1%;
Spearman rank correlation between risk_percentile_family and current risk_percentile;
top-1% entrant/leaver counts and road-family breakdown.

The headline question is whether the stitched ranking improves on the global model’s headline metrics. The pre-run hypothesis is modest improvement in stability, because each model has less within-family structural variation to learn, and small change in pseudo-R2.

6.2 Per-family metrics

For each family, report:

pseudo-R2 on held-out link-years within the family;
mean residual on training and held-out link-years;
observed vs predicted by family risk decile;
5-seed Jaccard on family-specific top-k sets.

Per-family top-k thresholds should scale with family size:

Family	Suggested thresholds
Motorway	top 25, top 50, top 100, top 10%
Trunk A-road	top 50, top 100, top 500, top 10%
Other-Urban	top 100, top 1,000, top 10,000, top 1%
Other-Rural	top 100, top 1,000, top 10,000, top 1%

For motorway, top-1% is only about 40 links and will be noisy. A top-10% diagnostic is more meaningful for that family.

The pre-run hypothesis is that motorway and trunk A families show the largest improvement in mean residual over the global model. Other-Urban and Other-Rural may show smaller improvement because the global model already has many of the relevant road-function, RUC, and exposure signals.

6.3 Family-boundary discontinuity check

Stitching per-family predictions can produce ranking discontinuities if family-level calibration differs.

Diagnostics:

Compute the predicted_xgb_family value at the top-1% threshold within each family. Are the family thresholds similar, or does one family require much larger predicted counts to enter its own top 1%?
At each of the three global stitched thresholds (top-1%, top-1000, top-10000), sample adjacent boundary-crossing link pairs stratified by family pair. The six possible family pairs are: Motorway × Trunk A-road, Motorway × Other-Urban, Motorway × Other-Rural, Trunk A-road × Other-Urban, Trunk A-road × Other-Rural, and Other-Urban × Other-Rural. For each family pair at each threshold, sample up to 25 adjacent pairs where the two links straddle the threshold and belong to different families. If a family pair has fewer than 25 available pairs at a threshold, take all available pairs and report the count explicitly; a low count indicates limited rank-range overlap for that pair, which is itself a calibration signal. Total target per threshold is up to 150 pairs (6 pairs × 25 max each). Compare predicted values across the sampled pairs. Small gaps indicate smooth stitching; large gaps indicate calibration discontinuity. Small counts for rare family pairs (e.g. Motorway × Other-Rural) are informative—they mean those families’ risk ranges barely overlap, which the global model may not be handling well.

The pre-run hypothesis is that small discontinuities are likely. Large discontinuities would make the stitched ranking operationally questionable, even if within-family rankings are useful.

6.4 5-seed harness extension

The existing src/road_risk/model/rank_stability.py harness runs seeds 42-46 against the global model and writes data/models/rank_stability/seed_<N>.parquet. Extend it with an optional per-family mode.

For each seed:

build the same Stage 2 base table;
assign family;
train four XGBoost models, one per family;
write per-family score files to data/models/family/seed_<N>/<family>.parquet;
write a stitched score file to data/models/family/seed_<N>/stitched.parquet;
compute pairwise Jaccard and Spearman for the stitched ranking;
compute family-specific Jaccard for each family.

The output report should be reports/family_rank_stability.md, following the same structure as reports/rank_stability.md.

Runtime expectation: per-family training is faster per model because each family has fewer rows, but the total job is 4 families x 5 seeds = 20 training runs. Total runtime is plausibly comparable to the existing global 5-seed run, around several hours. The eventual implementation prompt should treat this as a “leave it running” job.

6.5 Comparison against global baseline

The evaluation report should compare:

Model surface	Metrics
Current global XGBoost	pseudo-R2, 5-seed Jaccard at each k, Spearman, calibration
Per-family stitched	same headline metrics
Per-family within-family	pseudo-R2, residuals, calibration, family-scaled top-k Jaccard

Use the existing global baseline from reports/rank_stability.md:

Metric	Current global result
pseudo-R2 mean across 5 seeds	0.323498
pseudo-R2 std across 5 seeds	0.002678
top-1% Jaccard mean	0.903575
top-1% Jaccard min	0.896574
Spearman mean	0.999140
Spearman min	0.999069

Adoption criterion: recommend the per-family approach if either:

headline stitched metrics improve materially; or
per-family diagnostics reveal patterned residuals that the global model was missing, even if the headline stitched gain is modest.

If neither condition holds, document the split as evaluated and not adopted.

7. Caveats

7.1 Imputed RUC for about 1k active links

The RUC fill assigned about 336k links with ruc_imputed = True. Most are outside road_link_annual.parquet. The practical rural-default fallback impact is 993 active modelled links, about 0.05% of the 2.17M-link network. These links land in Other-Rural. The impact is small, but output documentation should keep ruc_imputed and ruc_fill_method visible for audit.

7.2 Motorway family size

The motorway family has 4,084 links. That is small for XGBoost with the current global hyperparameters. The 5-seed evaluation should surface motorway-specific instability if the model overfits. v2 candidates include reduced n_estimators, early stopping, or partial pooling with trunk A-road.

7.3 Stitched ranking calibration

Per-family Poisson XGBoost with the same exposure offset should produce comparable predictions across families, but calibration can still differ. The section 6.3 discontinuity diagnostic is mandatory. If calibration differs materially, the stitched ranking should not be adopted; only within-family rankings should be reported.

7.4 Per-family EB k is deferred to v2

Current EB shrinkage uses one global positive-event-weighted k. With per-family models, per-family k is the natural pairing because each family may have its own dispersion structure. EB session 1’s non-constant dispersion finding suggests this is likely. It is deferred to v2 so the mean-model split can be evaluated cleanly first.

7.5 Intersections / roundabouts not separated

form_of_way == "Roundabout" links remain inside their parent family for v1. HSM treats intersections as separate site types, so a separate intersection or roundabout family is methodologically plausible. It is deferred because it would bundle another family-definition decision into v1. Promote it only if v1 residuals show form-of-way patterning.

7.6 GLM unchanged, XGBoost-only split

Per-family GLM is not implemented in v1. The current global GLM remains a diagnostic baseline; its later post-grade/post-fix run sits around pseudo-R² 0.347 on the downsampled in-sample GLM surface. Per-family GLM is a v2 extension if interpretability by family becomes important.

8. Out of scope for v1

Hyperparameter tuning per family.
Per-family feature pruning.
Per-family EB k.
Hierarchical or partial-pooling models.
Roundabout or intersection family.
Per-family GLM.
NHNM integration.

NHNM integration depends on this work landing first and should remain a separate later design.

9. Expected outcomes

These are hypotheses to compare against v1 results, not pass/fail thresholds.

Outcome	Pre-run expectation
Headline pseudo-R2	approximately unchanged around 0.32, or slightly improved
Stitched top-1% 5-seed Jaccard	approximately unchanged or marginally improved relative to current 0.904
Motorway mean residual	significantly closer to zero than the current global mean residual around -3.3
Other-Urban / Other-Rural metrics	roughly comparable to global, because the global model already has many relevant signals
Family-boundary discontinuity	some discontinuity is likely; small is acceptable, large is a warning against stitched ranking adoption

The most valuable positive result would be not just a higher headline pseudo-R2, but clearer family-specific calibration and residual behaviour.

10. Implementation status

Sessions 1 & 2 (Complete - April 25, 2026): The network was successfully split into Motorway (4,084 links), Trunk A, Other-Urban, and Other-Rural based on ONS RUC and road function. * Result: Stitched all-links pseudo-R² improved to 0.895 vs the global 0.888. * Artefact: risk_scores_family.parquet was generated successfully alongside the production scores. * Diagnostics: Full validation available in reports/family_validation.md.

Session 3 and 4: Deferred. Adoption decision reached from single-seed evidence (see §11). Session 3 (5-seed harness) would confirm the motorway reversal at multi-seed grain, but is deferred pending v2 redesign rather than running on the v1 specification.

11. Session 1–2 conclusions

Sessions 1 and 2 produced single-seed evidence on the per-family approach. Session 3 (5-seed harness) was deferred pending v2 design.

Findings

Stitched ranking is calibration-clean. Largest adjacent different-family predicted-value gap is 0.0047 across all family pairs at top-1%, top-1000, and top-10000 thresholds. The stitching itself is operationally sound.
Motorway calibration improvement is robust. Per-family motorway mean residual is +0.13, vs the global model’s known −3.3 under-prediction. This is the design doc’s primary v1 hypothesis and is supported.
Held-out pseudo-R² is essentially unchanged on three of four families. Apples-to-apples held-out link-year deltas: trunk_a +0.006, other_urban +0.001, other_rural +0.002. All within seed-noise of zero. The all-data gains in reports/family_validation.md §6.2.1 were partly training-set fit, not held-out generalisation.
Motorway pseudo-R² reverses on held-out. All-data delta +0.052 vs held-out delta −0.027. The per-family motorway model overfits its small training set (~4k links). This is consistent with the §3.1 flag that n_estimators=500 may be too much capacity at motorway scale.

Adoption

Do not adopt v1 stitched ranking as a replacement for global risk_percentile. The held-out evidence does not support a meaningful generalisation gain, and the motorway pseudo-R² reversal is a genuine concern.

The motorway calibration improvement is methodologically interesting and motivates v2 work. The per-family approach is documented as evaluated, with v1 outputs available diagnostically in data/models/risk_scores_family.parquet.

v2 design questions (deferred)

Motorway hyperparameters. n_estimators=500 is plausibly too large for a 4k-link family. v2 candidates: reduced n_estimators (e.g. 100–200), early stopping with held-out validation, or partial pooling.
Network expansion. Going from N+C England (~4k motorway links) to all-GB (~6k) may reduce overfitting risk. Necessary but probably not sufficient: the held-out near-zero deltas on the other three families suggest the gain mechanism is not simply “more data per family.”
Partial pooling. Motorway and trunk-A could be pooled into a single “high-spec” family with shared structure. This would preserve the calibration improvement on motorway while increasing training data and reducing overfitting risk.
Per-family EB k. The v1-to-v2 progression flagged in §7.4 still applies. Coherent with per-family modelling but requires the mean-model change to land first.
Network topology features. None of the v1 work addressed the underlying feature gap (curvature, gradient, junction type/density beyond what is in form_of_way). Per-family modelling without new features may be hitting an information ceiling on what existing features can predict.

Session 3 (§6.4) remains specified. It would confirm or refute the single-seed reversal at multi-seed grain. Deferred pending v2 redesign rather than running on the v1 specification.