Open Road Risk
  • Home
  • Project
    • Project overview
    • Current model status
    • AI-assisted development
  • Background
    • Metrics and methodology
    • Literature evidence register
  • Literature
    • Crash frequency models
    • Exposure and traffic volume
    • Spatial methods and network risk
    • Junctions and conflict structure
    • Severity modelling
    • Validation and metrics
    • Transferability and open data limits
  • Data Sources
    • Overview
    • STATS19 Collisions
    • OS Open Roads
    • AADF Traffic Counts
    • WebTRIS Sensors
    • Network Model GDB
  • Methodology
    • Methodology Overview
    • Joining the Datasets
    • Feature Engineering
    • Empirical Bayes Shrinkage
  • Exploratory Data Analysis
    • Collision EDA
    • Collision-Exposure Behaviour
    • Vehicle Mix Analysis
    • Road Curvature
    • Months and Days of Week
    • Traffic Volume EDA
    • OSM Coverage
  • Models
    • Modelling Approach
    • Stage 1a: Traffic Volume
    • Stage 1b: Time-Zone Profiles
    • Stage 2: Collision Risk Model
    • Facility Family Split
    • Model Inventory
  • Outputs
    • Top-risk map
  • Future Work

On this page

  • 1 Overall coverage
  • 2 Coverage by road class (%)
  • 3 Coverage by latitude band (%)
  • 4 Value distributions (populated rows only)
    • 4.1 speed_limit_mph
    • 4.2 lanes
  • 5 Highlights
    • 5.1 Columns with >80% coverage by road class (usable without imputation)
    • 5.2 Columns with <20% coverage by road class (imputation would invent most values)
    • 5.3 Decision guidance

OSM Feature Coverage Diagnostic

1 Overall coverage

column n_links n_filled pct_coverage
is_unpaved 2167557 350597 16.2
lanes 2167557 158217 7.3
lit 2167557 201442 9.3
speed_limit_mph 2167557 1223142 56.4

2 Coverage by road class (%)

column Motorway A Road B Road Classified Unnumbered Unclassified Not Classified Unknown
is_unpaved 36.2 27.4 24.9 20.2 16.7 12.5 9.2
lanes 40.3 34.7 21.9 13.2 4.4 1.9 1.5
lit 38 25 16.4 10.8 10.1 3.6 2.3
speed_limit_mph 46.1 55.3 52.4 53.9 59.4 57.4 51.1

3 Coverage by latitude band (%)

column 51–52°N 52–53°N 53–54°N 54–55°N 55–56°N
is_unpaved 2.7 22.4 13.7 17.2 8.9
lanes 1.2 10.2 5.6 9.2 4.3
lit 1.2 11.3 8.2 12.1 8.5
speed_limit_mph 10.8 75 47.8 66 36.2

4 Value distributions (populated rows only)

4.1 speed_limit_mph

road_class n_filled n_distinct min q25 median q75 max
Motorway 1882 32 15 70 70 70 70
A Road 86016 65 10 30 36 50 115
B Road 46830 58 10 30 30 40 125
Classified Unnumbered 102922 61 10 30 33 40 195
Unclassified 630118 74 6 22 26 30 220
Not Classified 129078 62 6 21 25 29 224
Unknown 226296 76 6 20 22 26 224

4.2 lanes

road_class n_filled n_distinct min q25 median q75 max
Motorway 1644 8 1 2 2 3 10
A Road 53962 7 1 2 2 2 8
B Road 19547 7 1 2 2 2 7
Classified Unnumbered 25119 6 0 2 2 2 5
Unclassified 47151 7 0 2 2 2 20
Not Classified 4347 6 0 1 2 2 5
Unknown 6447 6 1 1 2 2 6

5 Highlights

5.1 Columns with >80% coverage by road class (usable without imputation)

No column × road-class combination reaches 80% coverage.

5.2 Columns with <20% coverage by road class (imputation would invent most values)

road_class column pct_coverage n_links n_filled
Unknown lanes 1.5 442836 6447
Not Classified lanes 1.9 224878 4347
Unknown lit 2.3 442836 10265
Not Classified lit 3.6 224878 8139
Unclassified lanes 4.4 1060014 47151
Unknown is_unpaved 9.2 442836 40849
Unclassified lit 10.1 1060014 107342
Classified Unnumbered lit 10.8 190921 20620
Not Classified is_unpaved 12.5 224878 28159
Classified Unnumbered lanes 13.2 190921 25119
B Road lit 16.4 89286 14618
Unclassified is_unpaved 16.7 1060014 176644

5.3 Decision guidance

For each column × road-class, coverage determines the right modelling strategy:

  • ≥80%: Include as-is; drop the small fraction of missing rows.
  • 20–80%: Median-impute and add an {col}_imputed binary flag; coefficient reflects the imputed value and should be interpreted with caution.
  • <20%: The imputed value is invented for >80% of rows. The coefficient will primarily reflect the imputation default, not genuine signal. Consider excluding from the model or using road-class median as a proxy only if the proxy is defensible.

Note: coverage on major roads (Motorway, A Road) is typically higher because OSM contributors prioritise high-traffic routes. If those columns are included in a model trained on all road classes, the signal comes almost entirely from major roads and the imputed values for minor roads are close to noise.

Open Road Risk

 

Built with Quarto