Empirical Bayes Shrinkage

Status: Design complete; sessions 1–2 complete. Adoption recommendation in §10. Scope: v1 empirical-Bayes-flavoured shrinkage for Stage 2 risk ranking. Primary code path: src/road_risk/model/collision.py and a small diagnostic module. Reference model: FHWA Highway Safety Manual (HSM) network-screening EB formula, adapted to the current XGBoost point predictions.

0. Background: Empirical Bayes Shrinkage: A Primer

Before diving into the mathematical formulation and implementation details, it is helpful to understand the core concept of Empirical Bayes (EB) Shrinkage and why we are applying it to our Stage 2 risk rankings.

The Problem: Pure Prediction vs. Reality

Currently, our XGBoost model ranks road links based purely on mathematical predictions. While powerful, this approach has two major blind spots when dealing with rare events like car crashes: 1. The “Fluke” Problem: A historically safe road might experience one freak accident. Relying purely on that raw observation makes the road look artificially dangerous. 2. The “Blind Spot” Problem: The model might predict a road is perfectly safe based on its features, but if there have been 10 crashes there in the last 5 years, the model is clearly missing something.

The Solution: EB Shrinkage

Empirical Bayes Shrinkage acts as a reality check. It calculates a statistical “weight” to blend the Model’s Prediction with the Observed Reality. * If a road has very little data or low exposure, the formula heavily trusts the AI model, preventing a single fluke accident from skewing the rankings. * If a road has a long, sustained history of actual crashes, the formula overrides the AI and trusts the historical reality.

Interactive Demonstration

Use the sliders below to see how the EB formula balances the model against reality. Notice how adjusting the k (Noise Parameter)—which represents how chaotic the baseline data is—shifts the balance of trust.

viewof predicted_rate = Inputs.range([0.01, 5.0], {value: 0.5, step: 0.01, label: "AI Predicted Rate (/yr)"})
viewof observed_total = Inputs.range([0, 50], {value: 12, step: 1, label: "Observed Total Crashes"})
viewof years_observed = Inputs.range([1, 20], {value: 10, step: 1, label: "Years Observed"})
viewof k_parameter = Inputs.range([0.1, 10.0], {value: 3.451158, step: 0.01, label: "k (Noise Parameter)"})

predicted_total = predicted_rate * years_observed
weight = 1 / (1 + (k_parameter * predicted_total))
eb_total = (weight * predicted_total) + ((1 - weight) * observed_total)
eb_annual_rate = eb_total / years_observed
observed_annual = observed_total / years_observed

// 2. Display the mathematical breakdown
md`
**Model Trust (Weight w):** ${(weight * 100).toFixed(1)}% &nbsp;&nbsp;|&nbsp;&nbsp; **History Trust (1 - w):** ${((1 - weight) * 100).toFixed(1)}%  
**Final EB Adjusted Rate:** **${eb_annual_rate.toFixed(2)} crashes/year**
`

// 3. Render the interactive chart
Plot.plot({
  marginLeft: 150,
  x: { 
    label: "Annual Crash Rate", 
    domain: [0, Math.max(predicted_rate, observed_annual, eb_annual_rate) * 1.2] 
  },
  y: { label: null },
  color: { 
    domain: ["1. AI Prediction", "2. Observed Reality", "3. EB Adjusted Result"],
    range: ["#4285F4", "#EA4335", "#34A853"]
  },
  marks: [
    Plot.barX([
      {type: "1. AI Prediction", rate: predicted_rate},
      {type: "2. Observed Reality", rate: observed_annual},
      {type: "3. EB Adjusted Result", rate: eb_annual_rate}
    ], {
      x: "rate", 
      y: "type", 
      fill: "type", 
      sort: {y: null}
    }),
    Plot.text([
      {type: "1. AI Prediction", rate: predicted_rate},
      {type: "2. Observed Reality", rate: observed_annual},
      {type: "3. EB Adjusted Result", rate: eb_annual_rate}
    ], {
      x: "rate", 
      y: "type", 
      text: d => d.rate.toFixed(2), 
      textAnchor: "start", 
      dx: 5
    })
  ]
})

1. Motivation and current units

Stage 2 currently ranks links by pooled predicted_xgb. This is a model-only ranking: it uses the XGBoost Poisson expectation and does not directly borrow from observed collision history. For rare events across roughly 2.17 million links, a pure point-prediction ranking can overstate certainty near narrow top-k thresholds.

Empirical Bayes (EB) shrinkage combines a model expectation with observed history. In highway safety screening, this controls regression to the mean: short links with one unusual collision are shrunk toward model expectation, while links with repeated observed collisions retain more observed evidence.

1.1 Current Stage 2 prediction unit

In train_collision_xgb(), XGBoost is trained with:

objective = "count:poisson"
base_margin = log_offset
log_offset = log(estimated_aadt * link_length_km * 365 / 1e6)

The offset is exposure in million vehicle-kilometres. XGBoost therefore returns an expected collision count per link-year, not a rate per vehicle-km.

In score_collision_models(), predictions are pooled to one row per link:

pool_agg = {
    "collision_count": "sum",
    "predicted_xgb": "mean",
}
pooled = df.groupby("link_id").agg(pool_agg).reset_index()

The output predicted_xgb in risk_scores.parquet is therefore:

mean expected collisions per year for the link.

The output collision_count is:

total observed collisions over all link-years included in the Stage 2 base table.

The current Stage 2 dataset is built from all years in aadt_estimates.parquet; the 5-seed rank-stability run logged 10 years, 2015-2024, giving 21,675,570 link-years and 2,167,557 scored links. The EB implementation should not assume 10 years permanently: it should compute n_years per link from the link-year table.

2. Mathematical formulation

Use the HSM combining formula, with XGBoost predictions as the prior mean:

expected_EB = w * predicted + (1 - w) * observed
w = 1 / (1 + k * predicted * n_years)

For a single link:

Symbol	Definition	Units
μ	pooled `predicted_xgb`	expected collisions/year
n	`n_years` for the link	years
N_pred	μ × n	total predicted collisions over observed period
N_obs	`collision_count`	total observed collisions over observed period
k	global NB2 overdispersion parameter	dimensionless

Apply the formula in total-count space:

w = 1 / (1 + k * N_pred)
N_EB = w * N_pred + (1 - w) * N_obs

Then convert back to a per-year ranking score:

predicted_eb = N_EB / n
risk_percentile_eb = percentile_rank(predicted_eb)

This keeps the weight and the observed history in count units, which matches the HSM form and avoids mixing annual means with multi-year totals inside the same term.

2.1 Weight interpretation

Condition	Effect on w	Ranking effect
Low predicted count and few observed years	w closer to 1	stronger pull to model prediction
High predicted count or many observed years	w lower	more observed-history influence
Higher k	w lower	more observed-history influence
k near 0	w near 1	EB converges toward current model ranking

The intended behaviour is conservative. A very short, low-exposure link with a single unusual collision should not leap to the top solely because its observed rate is extreme. A link with sustained observed collisions across many years should retain more of that history.

3. k estimation

3.1 Primary method for v1: method-of-moments

v1 uses one global NB2 overdispersion parameter, k, estimated by method-of-moments (MoM) from the existing Stage 2 predictions and observed link-year collision counts. This avoids fitting a global NB GLM on 17M+ complete-case link-years as the default path.

The procedure is:

Use existing predicted_xgb values from the current Stage 2 scoring output, or rerun scoring once for this purpose without retraining the model.
Work at link-year grain, pairing each observed collision_count with its link-year predicted_xgb.
Bin link-years by predicted collision count, using 20 quantile bins by predicted_xgb as the default. If many bins have too few positive observations, merge adjacent bins until each retained bin has at least 10,000 link-years and at least 100 positive collision rows.
Within each bin, compute empirical mean and variance of observed link-year collision_count.

Under NB2:

Var(y) = E(y) + k * E(y)^2

so:

k_bin = (Var(y) - E(y)) / E(y)^2

Drop bins where Var(y) <= E(y), because they imply non-physical negative NB2 dispersion. Log the dropped-bin count and fail loudly if more than 25% of bins are dropped. A small number of non-overdispersed bins is plausible sampling noise, especially in very low-risk bins; more than 25% suggests the global NB2 dispersion assumption is not behaving well enough for a single borrowed k.
Aggregate retained k_bin values using a positive-event weighted mean as the primary k:
```
k = sum(n_pos_bin * k_bin) / sum(n_pos_bin)
```
where n_pos_bin is the count of positive-collision link-years in each bin.

Diagnostic alternatives — also computed by eb_dispersion.py but not used for production scoring:
- link-year weighted mean: sum(n_bin * k_bin) / sum(n_bin)
- median of retained k_bin values
Why positive-event weighting? Session 1 found strongly non-constant dispersion: k_bin varied by approximately 3,400× across predicted-risk bins, declining monotonically from low- to high-prediction. When dispersion is non-constant, the aggregation scheme determines which part of the risk distribution dominates the global k. For a top-k screening application, the operationally relevant k weights toward where collision events actually occur. Link-year weighting pulls k toward low-prediction bins — which contain most link-years but few events, a region irrelevant for screening. Positive-event weighting concentrates weight in the event-rich, high-prediction bins where the top-k boundary sits. The refreshed May 2026 production EB run uses k_positive_weighted = 3.451158, with diagnostic alternatives k_link_year_weighted = 14.898721 and k_median = 7.573279. See reports/eb_dispersion.md and reports/top_risk_output_qa.md for detail.

The provenance report should include:

method: "method_of_moments";
k point estimate;
k_bin values;
bin definitions and row counts;
aggregation rule;
number and share of dropped negative-k bins;
data fingerprint;
git SHA;
timestamp;
row counts used.

Persist this to data/provenance/eb_dispersion_provenance.json.

3.2 One-off calibration artefact, not every retrain

The persisted-k decision still applies under MoM. Estimate k as a one-off calibration artefact, not as part of every production retrain. Reuse it for EB scoring until the Stage 2 data snapshot, scoring model, geography, modelling years, or exposure construction changes.

This is acceptable because k is being used as a global overdispersion calibration constant, not as a model parameter that needs to track seed-level XGBoost variation. XGBoost seed changes perturb tree structure and rankings; they do not change the underlying collision-count overdispersion in the link-year table.

MoM scales well because it only requires binning and summary statistics. If some predicted-rate ranges have insufficient bin counts, increase bin width or merge adjacent bins rather than subsampling.

3.3 Optional follow-up: NB GLM cross-check

If the MoM k looks suspicious, run an NB GLM as a cross-check rather than as a matter of course. Suspicious signs include:

implausibly tiny k, making EB nearly identical to the unshrunken ranking;
implausibly large k, making EB behave almost like observed-history ranking;
k_bin varying wildly across predicted-risk bins;
more than a small number of bins dropped for non-physical negative k_bin;
strong sensitivity in §7 when comparing the three k aggregation schemes (positive-event weighted, link-year weighted, and median).

Two cross-check paths are acceptable:

Try statsmodels NegativeBinomialP MLE first. It jointly estimates coefficients and alpha in a single fit, which should be faster than a profile likelihood, but may have convergence issues on this data size.
If NegativeBinomialP fails to converge or gives unstable alpha, use profile likelihood over an alpha grid. This is slower, but more diagnosable because the likelihood curve can be inspected directly.

The NB GLM cross-check should report whether its alpha is operationally close to the MoM k, not replace MoM automatically. Any decision to switch from MoM to an NB-GLM-derived k should be recorded as a methodology change.

3.4 Methodological note

MoM and maximum-likelihood estimation target the same NB2 dispersion parameter under a correctly specified NB2 data-generating process. They can differ when the data deviate from NB assumptions, when dispersion varies by facility type, or when the mean model is misspecified.

For this v1 design, the practical consequence is limited: k is a global dispersion constant feeding a sensitivity-analysed shrinkage formula, not a link-level posterior. If MoM and MLE differ modestly, the §7 sensitivity analysis should reveal whether that difference matters operationally. If they diverge enough to change top-k membership materially, the borrowed-k method is fragile and should remain diagnostic.

3.5 Design choices retained from TODO

This follows the TODO decisions:

global k, not per-family k;
Stage 2 scoring inputs (the GLM/XGBoost feature lists that produce predicted_xgb) are unchanged;
exposure offset log_offset;
full-data distribution, not the GLM’s downsampled zero table;
shrink toward predicted_xgb, not predicted_glm.

Because MoM estimates dispersion from binned prediction/observation summaries, the predictions used for binning should come from the current Stage 2 scoring path whose GLM/XGBoost inputs are already fixed and documented.

3.6 Future extension: per-family k

One global k is a v1 simplicity choice. Different facility families likely have different residual dispersion: motorways may be more predictable than minor rural roads. Per-family k belongs with the planned facility-family split, where both the mean model and dispersion estimate can be separated in a coherent way.

4. Handling n_years

n_years must be computed per link from the same link-year table that is being scored:

n_years = df.groupby("link_id")["year"].nunique()

Do not assume a fixed 10-year value. Current production-like runs use 10 years, but future Stage 2 runs may add, drop, or partially cover years. Per-link n_years makes EB weights robust to missing-year coverage and makes the output auditable.

A pooled link with n_years = 0 indicates unexpected pooled dataframe contents, not a normal EB input edge case. Such a link should not exist if the pooled output was produced from link-year rows, so the guard should point back to pooling/scoring rather than silently trying to repair EB inputs.

The EB output should include:

Column	Meaning
`n_years`	distinct scoring years contributing to the link
`eb_weight`	`w = 1 / (1 + k * predicted_xgb * n_years)`
`predicted_eb`	EB-adjusted expected collisions/year
`risk_percentile_eb`	percentile rank of `predicted_eb`

Keep the existing risk_percentile unchanged. EB is an additional ranking, not a replacement until validation justifies adoption.

5. Implementation design

This is a design document only. The implementation should be small and separable from model training.

5.1 Proposed files

File	Proposed change
`src/road_risk/model/collision.py`	Add optional EB scoring hook after pooling, without changing GLM/XGBoost training
`src/road_risk/model/eb_shrinkage.py`	New small module for `compute_eb_scores()` and loading persisted `k`
`src/road_risk/diagnostics/eb_validation.py`	New diagnostic script/report generator
`data/provenance/eb_dispersion_provenance.json`	New provenance for one-off `k` estimation
`data/models/risk_scores.parquet`	Eventually add EB columns only after review

The first implementation pass should avoid retraining the GLM or changing XGBoost hyperparameters. It should compute EB after the existing pooled scores exist.

5.2 Core function

def compute_eb_scores(pooled, n_years_by_link, k):
    pooled = pooled.merge(n_years_by_link, on="link_id", how="left")
    n_pred = pooled["predicted_xgb"] * pooled["n_years"]
    pooled["eb_weight"] = 1 / (1 + k * n_pred)
    n_eb = (
        pooled["eb_weight"] * n_pred
        + (1 - pooled["eb_weight"]) * pooled["collision_count"]
    )
    pooled["predicted_eb"] = n_eb / pooled["n_years"]
    pooled["risk_percentile_eb"] = pooled["predicted_eb"].rank(pct=True) * 100
    return pooled

The implementation should guard against missing n_years, zero n_years, negative predictions, and missing collision_count. Missing or zero n_years should be reported as unexpected pooled dataframe contents, because a link with no link-year rows should not appear in the pooled score output. Negative predictions or missing counts should fail loudly as invalid EB inputs; silently imputing them would make the ranking hard to audit.

5.3 What does not change

Stage 1a exposure estimation.
build_collision_dataset().
GLM or XGBoost feature lists.
XGBoost hyperparameters, including n_jobs=1.
Existing risk_percentile semantics.
The canonical risk_scores.parquet until EB has been reviewed.

6. Validation plan

Validation has two goals: determine whether EB improves the ranking’s operational usefulness, and determine whether it reduces seed-induced churn at narrow top-k thresholds.

6.1 Single-run ranking movement

Run the EB diagnostic against one scored output and report:

distribution of n_years;
value of k and its provenance;
distribution of eb_weight;
median absolute percentile change;
90th and 99th percentile absolute percentile change;
number and percentage of links that move more than 10 percentile points;
number and percentage of links that move more than 25 percentile points.

6.2 Top-1% comparison

Compare current and EB-adjusted top-1% membership:

current top-1% count;
EB top-1% count;
intersection size;
count entering EB top-1%;
count leaving EB top-1%;
road-class breakdown of entrants and leavers.

The top-1% count should be mechanically similar because both rankings are percentiles. The useful signal is membership change and the characteristics of links entering or leaving. At low predicted values, many links can have very similar XGBoost predictions; use pandas percentile ranking with method="average" for the percentile columns, and use an explicit predicted_score desc, link_id asc sort when selecting exact top-k sets so the membership comparison is well-defined.

6.3 Qualitative link review

Review 5 links that leave the top 1% under EB and 5 links that enter. For each, record:

Field	Purpose
`link_id`	auditability
`road_classification`	road type
`estimated_aadt`	exposure proxy
`collision_count`	observed total
`n_years`	observed period
`predicted_xgb`	model prior per year
`predicted_eb`	EB-adjusted expected collisions/year
`eb_weight`	amount of shrinkage
`risk_percentile`	current percentile
`risk_percentile_eb`	EB percentile

Expected patterns are hypotheses, not pass/fail criteria:

exits may be short or low-exposure links whose high rank is not supported by repeated observed history;
entrants may be links where observed history is persistently higher than the model prior.

6.4 Seed-churn intersection diagnostic

After identifying EB entrants and leavers, cross-reference the same links against the seed-churn sets from reports/rank_stability_investigation.md. The current unshrunken 5-seed investigation found:

234 distinct links churning in or out of top-1000 across seed pairs;
21 distinct links churning in or out of top-100 across seed pairs.

For each seed-churn set, report:

count of seed-churning links whose EB percentile changes by more than 5 points;
count of seed-churning links whose EB percentile changes by more than 10 points;
direction of EB movement for seed-churning links, separated into movement toward the top 1% and movement away from the top 1%;
the same movement rates for the full scored population, as a reference.

The hypothesis is that EB should disproportionately affect borderline links, which should overlap with the population that seed-churns. If seed-churning links do not show systematically larger EB movement than the general population, EB is probably not addressing the source of seed-induced ranking instability, even if its overall ranking effects are sensible. This is a diagnostic, not a pass/fail criterion.

Session 2 finding

The hypothesis was tested in session 2 and was not supported. The seed-churn population moved barely more than the general scored population:

population	n_links	EB change >5 pts	EB change >10 pts
seed_churn_top_100	21	0 (0.00%)	0 (0.00%)
seed_churn_top_1000	234	2 (0.85%)	1 (0.43%)
full scored population	2,167,557	16,114 (0.74%)	6,597 (0.30%)

The seed-churn population showed movement rates at or below the general population rate — the opposite of what the hypothesis predicted.

Caveat: The seed-churn sets are small (21 and 234 links). The result should be read as “no detectable effect” rather than a precisely quantified null. The qualitative direction is nonetheless clear: if EB were targeting borderline links, we would expect substantially elevated movement rates in the churn population, not parity with the general population. EB and seed stability address different mechanisms (see §8.5).

6.5 Five-seed stability comparison (optional — deferred)

The 5-seed rank-stability harness exists at src/road_risk/model/rank_stability.py and produces per-seed score files in data/models/rank_stability/. The harness could be used for EB validation by applying EB columns to each seed’s pooled predictions with the same persisted k and comparing pairwise Jaccard on EB-adjusted top-k sets against the unshrunken baseline:

k	unshrunken mean Jaccard	unshrunken min Jaccard
100	0.903	0.869
1000	0.893	0.864
10000	0.927	0.919
top-1%	0.918	0.908

This comparison is deferred. Session 2’s seed-churn diagnostic found that EB does not disproportionately affect borderline links — the population that seed-churns. Running the full 5-seed EB comparison is therefore unlikely to surface a different conclusion (EB-adjusted Jaccard is likely to be similar to or slightly worse than unshrunken, because EB introduces movement without addressing the underlying source of seed-induced churn). The 5-seed EB comparison remains feasible if explicit confirmation is wanted, but it is not required for the adoption decision and is deferred.

6.6 Reporting

The EB validation report should include:

old vs EB Jaccard table;
old vs EB Spearman rank correlation;
top-1% entrant/leaver review;
k-sensitivity results from §7;
clear recommendation: adopt EB, keep as diagnostic only, or defer pending a better dispersion model.

7. k sensitivity

Because v1 borrows a single global k, adoption should depend on whether the ranking is robust to plausible k variation. Given the session 1 finding of strongly non-constant dispersion (k_bin varying ~3,400× across the predicted-risk range), the relevant uncertainty in the global k is the ambiguity between aggregation schemes, not noise around a single estimate. Testing 0.5×k / 2×k around any single aggregation tests less-relevant variation. Compute EB rankings at the three aggregation values:

k_positive_weighted  (production)
k_link_year_weighted (diagnostic)
k_median             (diagnostic)

For each value, report:

top-1% membership size;
pairwise top-1% Jaccard across the three k values;
Spearman correlation between EB rankings produced by the three k values;
count of links whose EB percentile changes by more than 1, 5, and 10 percentile points between production k and each diagnostic alternative.

If top-1% membership or pairwise Jaccard moves materially across the three aggregations, then the borrowed-k methodology is operationally fragile under non-constant dispersion. That would not invalidate EB as a concept, but it would mean this v1 implementation should remain diagnostic until dispersion is estimated more coherently, for example by facility family or by an NB model whose mean is itself used for screening.

Refreshed May 2026 finding. Top-1% Jaccard between positive_weighted and the link-year-weighted diagnostic alternative is 0.954375, meaning about 4.6% top-1% membership churn from the aggregation choice alone. Spearman rank correlation is >0.999 across all pairs, but this is dominated by the low-prediction tail where all three k values agree; the meaningful disagreement is at the top of the ranking. The ~4.6% top-1% membership churn is fragile but not catastrophic. The borrowed-k method is sensitive enough to the aggregation choice that the adoption decision should be logged alongside the specific k value used. See reports/eb_validation.md §7 for the full numbers.

8. Caveats

8.1 Borrowed k is a compromise

HSM EB assumes an NB safety-performance function supplies both the prior mean and the dispersion parameter. In this design, the prior mean is the XGBoost Poisson point prediction, while k is borrowed from a global dispersion estimate, primarily method-of-moments and optionally an auxiliary NB GLM cross-check.

This can be reasonable for a v1 diagnostic because k describes residual overdispersion in collision counts, but it is not fully statistically coherent. If k is too high, EB will overweight observed history; if k is too low, EB will look too similar to the current XGBoost ranking.

8.2 This is not a full HSM EB implementation

The output should be described downstream as HSM’s shrinkage formula applied to our model’s point predictions with a borrowed dispersion parameter. It is empirical-Bayes-flavoured shrinkage, not a full HSM EB implementation and not a posterior distribution. In particular:

predicted_eb is a point estimate, not a posterior mean with uncertainty intervals;
risk_percentile_eb is a deterministic ranking from that point estimate;
the method does not produce posterior credible intervals or link-level uncertainty bands;
the dispersion parameter is not jointly estimated with the XGBoost mean.

This distinction matters for public wording. The safest language is “EB-adjusted ranking” or “EB-style shrinkage”, not “posterior risk”.

8.3 The GLM intercept issue is avoided, not solved

The current Poisson GLM is fit after zero downsampling, so its intercept is not a calibrated production prior. EB therefore shrinks toward predicted_xgb, not predicted_glm. The primary MoM k estimate should use the full scored link-year distribution, not the downsampled GLM table. If an auxiliary NB GLM is used as a conditional cross-check, it also must not reuse the downsampled GLM table; it must be fit on the full-data distribution or a transparently weighted approximation to it.

8.4 EB should not be mixed with feature ablations

EB changes the ranking definition. Feature/model ablations such as OSM imputation, curvature/grade, or facility-family splitting should either:

compare all arms without EB; or
apply the same EB procedure to every arm.

Otherwise the effect of a new feature set will be confounded with the effect of changing the ranking target.

8.5 EB scope: regression-to-the-mean, not seed stability

EB shrinkage corrects regression-to-the-mean at the extremes of the predicted/observed alignment: links with high model prediction and zero observed collisions move down; links with low model prediction and many observed collisions move up. This is a genuine and defensible adjustment for a screening output.

It is not, however, a solution to seed-induced top-k churn, which arises from a separate mechanism — small perturbations to model predictions across XGBoost seeds flipping borderline links across narrow thresholds. The session 2 seed-churn diagnostic confirmed empirically that EB does not target the seed-churn population (§6.4). Practitioners should not adopt EB expecting it to improve cross-seed ranking stability.

9. Observed outcomes

Session 2 provides empirical evidence against the pre-run priors stated in this section.

risk_percentile_eb is highly correlated with risk_percentile, but not identical. This matched: top-1% intersection is 84.93% (18,408 of 21,675 links); median absolute percentile change is 0.07 points; p99 change is 2.96 points.
Narrow top-k lists are not stabilised by EB. The pre-run prior (“narrow top-k lists may become more stable if EB dampens seed-specific threshold churn”) was not supported. Session 2’s seed-churn diagnostic found EB does not disproportionately affect the seed-churn population. EB is unlikely to improve narrow top-k stability.
Short or low-exposure links with limited observed support drop. This matched: the qualitative review found 5 Motorway links with predicted_xgb ≈ 1.0 collision/year and zero observed collisions over 10 years dropping out of top-1%.
Links with repeated observed collisions relative to the model prior rise. This matched: A Road, Motorway, and Classified Unnumbered links with 9–14 observed collisions against a model prior of ~0.03 collisions/year entered top-1%.
Headline XGBoost pseudo-R² is unchanged. EB is a post-scoring ranking adjustment; it does not alter the XGBoost training-time metric. If EB-adjusted predictions are scored against held-out collisions in a downstream evaluation, that metric can move, and ideally should improve because EB incorporates observed counts.

The substantive ranking shifts EB produces — motorway zero-collision links down; high-observation links up — are mechanically defensible regression-to-the-mean corrections. The adoption case rests on whether those shifts are operationally wanted, not on whether EB also fixes seed stability (it does not; see §8.5).

10. Adoption recommendation

Following sessions 1–2, the recommendation is:

Adopt EB as an additional ranking (risk_percentile_eb), not a replacement for risk_percentile. EB does defensible regression-to-the-mean correction at the prediction/observation extremes. The top-risk map uses risk_percentile_eb as the preferred screening ranking for that output; raw risk_percentile remains available, so these positions are consistent. It does not improve cross-seed top-k stability. Both rankings should be available in downstream outputs with their semantics clearly documented:

risk_percentile — pure model-prior ranking; seed-sensitive near narrow thresholds; no observed-history correction.
risk_percentile_eb — EB-adjusted ranking; corrects regression-to-the-mean at the extremes; uses refreshed positive-event-weighted k = 3.451158; about 4.6% top-1% membership differs across the positive-event-weighted and link-year-weighted MoM alternatives; does not reduce cross-seed membership churn.

Per-family or per-bin EB remains the cleaner methodological answer to the non-constant dispersion finding from session 1, and is recommended for v2 paired with the planned facility-family split.