A checkable literature review for Open Road Risk

How AI-assisted extraction, source grounding and human review were used to build an auditable evidence base.

methods

literature-review

A site methods note on using AI-assisted literature review without treating the model as an authority.

Published

June 4, 2026

Modified

July 1, 2026

Open Road Risk needed a literature review that could do more than summarise papers. It needed an evidence base that could audit modelling decisions.

The project estimates exposure-adjusted collision risk for around 2.17 million road links across Great Britain. That means the literature review has a practical job: establish what the road-safety modelling field already knows, identify where the project is aligned with that field, and expose where the modelling choices are weak, premature, or unsupported.

AI helped with the labour of that review. It was useful for extracting, organising and comparing sources. But it was only useful when the outputs were made checkable.

That is the central point of this note: AI-assisted literature review is not mainly a prompting problem. It is a domain-knowledge and verification problem.

How to read this page

This is a project methods note, not a formal systematic review and not a product review of AI research tools. It explains how Open Road Risk used AI to help build a checkable literature record, and where human judgement still had to sit in the workflow.

The page is included on the project site because it only makes full sense alongside the published evidence register, extraction prompts, modelling notes and code. The aim is not to show that one AI system is reliable. It is to show how a workflow can remain inspectable even when the tools used to support it change.

The problem

An ungrounded language model is a poor authority in a specialised field. It can summarise fluently, misread confidently and invent sources. That is not a reason to avoid AI-assisted review. It is a reason to design the review around verification.

This is not a flaw that the next model removes; it is structural. A language model is trained to produce a plausible continuation, not a true one. OpenAI’s own analysis argues that hallucination is a predictable consequence of how models are trained and scored — they are rewarded for confident guessing over admitting uncertainty — and that it persists in state-of-the-art systems ¹. Attention over a long input is also finite and uneven: models use information best at the start and end of a document and measurably worse in the middle, even models built for long contexts ². So newer models fabricate less, but they do not change the requirement to check. They lower the rate, not the risk — and they fail hardest exactly where it is hardest to notice.

The measured rates illustrate the point rather than define it. On older models, Walters and Wilder (2023) found 55% of GPT-3.5 and 18% of GPT-4 bibliographic citations fabricated ³; Chelli et al. (2024) found hallucination rates of roughly 29%–91% across models reproducing the references of existing systematic reviews ⁴. Those specific numbers will keep falling with each generation. What does not fall is the structural tendency, and it is worst in the specialised, less-covered corner of a field: fabrication runs higher on specialised topics than well-covered ones ⁵, and citation accuracy varies sharply by discipline ⁶ — which is precisely the part you do not yet know well enough to check quickly.

The answer is not to ask for a better summary. It is to stop treating the model as an authority. Each claim needs a source. Each extraction needs a limitation field. Each synthesis needs to remain traceable back to the source it came from.

The existing field

This is not a claim of methodological novelty. AI-assisted evidence synthesis already has a formal literature, especially in medicine and systematic review work.

Studies now test LLMs for screening, extraction and review support, usually with human verification. Gartlehner et al. (2024) reported 96.3% extraction accuracy against human extraction in a proof-of-concept study ⁷. Schroeder et al. (2025), by contrast, found only 62%–72% consistency with human coding across three models and argued explicitly for human-in-the-loop validation ⁸. Other work reports useful time savings while still requiring human review and accountability ⁹.

Reporting standards are also developing. TRIPOD-LLM and the proposed PRISMA-trAIce checklist formalise transparency, oversight and disclosure for AI-assisted reviews ¹⁰ ¹¹. The publishing norm is clear: AI may assist, but it cannot be an author or carry responsibility. Human authors remain accountable for the content ¹².

So the contribution here is practical rather than novel: a transparent example of applying that discipline in a small, open, non-medical data-science project.

The case

Open Road Risk is built as an open project. The code, prompts, literature pages, evidence register and modelling documentation are all published alongside the work.

The literature review was not treated as decoration. Its job was to check the project against the field:

Are collision counts being modelled with appropriate count methods?
Are exposure offsets being used correctly?
Are model-validation choices defensible for sparse collision data?
Are severe and slight collisions being handled honestly?
Are spatial and network effects acknowledged rather than hidden?
Are claims about risk calibrated to what the data can support?

In this project, the review also worked retrospectively. Some modelling had already been done before the literature record was mature. That made the review useful as an audit: it tested existing choices against established practice and exposed results that were too good to trust.

The method

The method is a small pipeline of narrow AI-assisted steps feeding a structured evidence record, with human responsibility at each stage.

One source at a time

Each extraction starts with one paper, report or technical source. The model is not asked to write a broad literature review from memory. It is asked to extract from a specific source.

That matters because different stages fail in different ways. A prompt that searches, extracts, summarises and synthesises in one step can produce a clean-looking answer while hiding where the error entered. Splitting the work makes errors easier to find.

The workflow is:

extract one source;
record the claims and limitations;
add the source to the evidence register;
synthesise later, only when answering a specific question.

Human acquisition as a verification step

In practice, source-grounded review did not mean that the AI could fetch everything itself. A lot of useful material was not available as clean text to a model. Some papers and reports were reachable to a person but difficult to extract automatically because of publisher pages, redirects, PDF viewers, login flows, bot checks or other access friction. That could happen even when the source was nominally open access.

That friction was irritating, but it also had methodological value. It forced a human step at the point where it mattered most: deciding whether the source was real, relevant, accessible and worth adding to the evidence record. The file itself became part of the audit trail. Without a defined source file, stable URL or recorded extraction target, it is difficult to pass the same material to another model, rerun the extraction, or check whether a later summary was based on the actual source rather than a plausible description of it.

So the workflow treated source acquisition as part of verification, not just administration. The human had to find the paper or report, open it, scan whether it was the right thing, and only then use AI to extract from it. That step slowed the process down, but it also reduced the chance of building an evidence record around sources that were irrelevant, misidentified, inaccessible, or did not exist.

Tools are useful, but they do not replace the workflow

This workflow sits alongside a growing set of AI-assisted research tools. General research agents, such as ChatGPT Deep Research, are useful for broad discovery and synthesis because they can search across online sources, work with uploaded files and produce cited reports ¹³. Literature-specific systems, such as Elicit or Undermind, build more of the literature-review process into the product: paper search, screening, extraction, report generation, source libraries and citation-backed claims ¹⁴ ¹⁵. Other tools support adjacent parts of the process: Semantic Scholar and Litmaps help discover and map literature, while citation tools such as scite help inspect how papers are cited by later work ¹⁶ ¹⁷ ¹⁸.

Those tools are useful, but they do not remove the central problem. They can still return plausible but wrong answers, miss inaccessible sources, over-rely on abstracts, or summarise from material that the reviewer has not actually checked. Deep Research is a good example: it is often helpful for broad exploration, but OpenAI also notes that it can still hallucinate facts, make incorrect inferences, struggle to distinguish authoritative information from rumours, and fail to convey uncertainty accurately ¹⁹.

The distinction in this project was therefore not “which AI tool is best”. It was whether the workflow left a checkable chain: source identified, file acquired, prompt recorded, extraction kept atomic, claim linked to evidence, limitation recorded, synthesis delayed. A tool can support that chain, but it should not replace it.

A structured evidence record

Once a source has been acquired and selected, each evidence-record entry records:

what the source studies;
what methods it uses;
what it claims;
what evidence supports the claim;
what it does not show;
how transferable it is to Open Road Risk;
whether the project needs to act on it.

The limitation field is the most important part. It stops the literature record becoming a pile of positive-sounding summaries. A useful extraction does not only say “this paper supports X”. It also says whether the paper is junction-level, area-level or link-level; whether it uses open data; whether it models frequency, severity or both; whether it validates spatially; and whether its conclusions transfer to a UK road-link risk model.

Atomic records, late synthesis

The record stays atomic: one entry per source.

That is deliberately slower than asking for one neat summary. But early synthesis loses the detail needed later. If a model choice depends on whether a paper used a Poisson, negative-binomial, zero-inflated or spatial model, the original extraction needs to stay available. If a validation choice depends on whether a study used temporal holdout, spatial holdout or balanced accuracy, that detail cannot be flattened too early.

The topic syntheses then draw from the source records. For example, the crash-frequency model synthesis sits on top of the individual paper extractions rather than replacing them.

Grounding and checking

Every important claim is grounded in a real source. Grounding does not make the model safe by itself: retrieval-augmented systems can still produce unsupported claims ²⁰ ²¹. It makes the output checkable.

The operating rule is simple:

Treat unaided model claims as leads. Treat source-grounded claims as candidates. Treat checked claims as usable.

For load-bearing claims, the citation has to resolve, the number has to match the source, and the interpretation has to survive re-reading. Important extractions are sometimes rerun with a second model as a second reader, but the second model is not treated as proof. It is a way to find disagreement.

Verification can also be made quantitative rather than anecdotal. Holding out a small, human-checked gold-standard sample and scoring the AI extraction against it turns “I checked the important ones” into a measured agreement rate — the same move the evidence-synthesis studies above use to report extraction accuracy. But a score is not self-validating. A high number can mean the task was easy, the sample unrepresentative, or — as the modelling section shows — that something leaked. Benchmarks measure the gradeable part; they cannot tell you whether the thing being graded is the right thing. That judgement stays with the reader.

Recorded and disclosed

The prompts and AI-assisted steps are part of the project record. That matters because AI-assisted work should be auditable. It should be possible to see what was extracted, from where, and under what instruction.

The disclosure is not a formality. It is part of the method. The model can assist with extraction, comparison and drafting, but the responsibility stays with the author.

A small prompt-ablation case study

The workflow above can sound cleaner than the practice. To make the claim more inspectable, I ran a small prompt-ablation case study using one representative road-safety paper: Retallack and Ostendorf (2020), Relationship Between Traffic Volume and Accident Frequency at Intersections ²².

This was not a benchmark of ChatGPT and not proof that one prompt is optimal. It was a worked example. The question was narrower: when the paper, project context and extraction constraints are changed, how does the output change?

The paper was deliberately useful but imperfect for Open Road Risk. It is close enough to matter — traffic volume, accident frequency, Poisson and negative-binomial count models, rainfall risk and severity appear in the study — but it is not a direct match. It studies 120 urban intersections in Adelaide using hourly SCATS traffic-volume data and 1,629 matched motor-vehicle accidents. That makes it a good test case because a useful extraction should preserve both the relevance and the limits: intersection-level evidence is not link-year evidence, and dense hourly monitored traffic counts are not the same as sparse AADF-style national exposure estimates.

The runs used fresh temporary ChatGPT chats. The model setting was kept constant rather than optimised separately for each condition. The point was to test the input condition, not to compare models. A current medium-reasoning model was chosen because it was strong enough to be a realistic workplace tool, but not chosen to make the task artificially easy. This also kept the exercise closer to the kind of AI-assisted document work now appearing in workplace tools such as Microsoft 365 Copilot, rather than a best-case benchmark setting.

Run	Inputs given	Prompt condition	Output artefact	What it tested
A	Paper PDF + full Open Road Risk extraction prompt and project dossier	Full structured prompt	`A_paper-extraction-retallack-2020-traffic-volume-accident-frequency-intersections.md`	Best-case extraction under the project workflow
B	Paper PDF only	Generic project-aware extraction prompt	`B_road_safety_paper_methodological_extraction.md`	Whether a lighter project prompt still captures the useful methods
C	Paper reference only, no PDF	Reference-only extraction prompt	`C_reference_only_methodological_extraction_retallack_ostendorf_2020.md`	Whether the model refuses unsupported specificity
D	Paper reference + Open Road Risk context, no PDF	Conservative extraction with project context	`D_retallack_ostendorf_2020_conservative_extraction.md`	Whether project context creates useful focus or false confidence
E	Paper PDF only	Vague summary prompt	`E_retallack_ostendorf_2020_for_open_road_risk.md`	What happens when the task asks for usefulness before structured extraction
F	Paper PDF only	Structured extraction fields, but no Open Road Risk dossier	`F_road_safety_methodological_extraction.md`	What structure alone adds without project-specific transferability rules

The results were useful, but not in a laboratory-clean way. Run C behaved as intended: it was conservative and treated the output as suitable for screening, not full evidence extraction. Run D was more complicated. Although no PDF was attached, the model reported that it could see article or index-level information. That made the output stronger than a pure reference-only extraction, but it also changed what the condition meant. I kept it because the messiness is part of the lesson: in practical AI-assisted review, the reviewer needs to record not just the prompt, but what the model could actually access.

The comparison showed three practical things.

First, source access matters. Without the paper, the safest output is a screening note: title, likely topic, broad relevance and a list of details that cannot be known. If the output contains sample sizes, model families, coefficients or page references without access to the source, those details should be treated as unsupported.

Second, structure matters even without the full project dossier. Run F was not just filler. It captured a usable paper-level record: citation metadata, response variable, exposure handling, SCATS traffic counts, spatial unit, temporal unit, engineered congestion index, Poisson and negative-binomial handling, model-selection results and limitations. That suggests the extraction schema itself does useful work before any project-specific judgement is added.

Third, the project dossier matters most for transferability and restraint. The full Open Road Risk prompt did not merely ask for more detail. It told the model what kind of detail mattered: link-year compatibility, exposure offsets, sparse AADF transferability, grouped validation, post-event leakage, severity handling and least-disruptive repo actions. That is where domain knowledge enters the prompt. It does not guarantee correctness, but it makes the failure modes easier to spot.

The vague run was the cautionary example. It produced a readable summary and some reasonable project implications, but it was less useful as evidence. It moved quickly toward synthesis: what the paper means, why it matters, and what the project should do. That is often what the reader wants, but it is not the same as an auditable extraction. Once methods, limitations and transferability are flattened into a narrative, it becomes harder to see which claims came from the source and which came from the model’s generalisation.

The case study does not prove that the workflow is accurate. It shows why the workflow is inspectable. The published artefacts let a reader compare the same source under different evidence conditions and judge whether the structured prompt, source file and project context changed the output in useful ways.

The main lesson is therefore modest: prompt design is not just wording. A good extraction prompt defines the task boundary, the project context, the fields to preserve, the claims not to make, the limits to record, and the action discipline to apply. That does not replace verification. It gives verification something solid to work on.

What changed

The evidence record did real work. It did not just support the project after the fact; it changed how results were interpreted.

Two examples matter.

Catching a leak

A headline model result came in implausibly high: a gradient-boosted model produced an XGBoost pseudo-R² of about 0.86.

That should have been suspicious. Injury-collision models at link-year grain are noisy, sparse and hard to predict. A result that high did not fit the expectations set by the literature. The evidence record made that mismatch visible.

The audit found the problem: a heavy-goods-vehicle traffic feature had been joined so that it populated only the link-years that already contained collisions. That leaked the outcome into a predictor. The model was not discovering an unusually strong signal; it was being fed information it should not have had.

After the join was fixed, the honest XGBoost figure was about 0.32. The clean GB retrain later produced XGBoost pseudo-R² 0.325, while the Poisson GLM reached 0.566 on its own downsampled in-sample surface. Those figures are not a direct horse race, but the leak diagnosis still changed how the boosted model was interpreted: the earlier 0.86 result had been flattering the more complex model.

That is exactly what a literature-backed audit should do. It gives you enough field expectation to question a result that looks too good.

Details are documented on the feature-engineering page and model-results page.

Avoiding premature model escalation

The literature record also stopped a more complex model from being adopted too early.

Collision counts are often over-dispersed: they vary more than a simple Poisson model assumes. The road-safety literature has a standard response to that problem: negative-binomial models. The count-model synthesis, drawing on sources such as Lord and Mannering (2010) and Ver Hoef and Boveng (2007), made that clear.

The project turned this into a pre-specified rule: move from a Poisson GLM to a negative-binomial GLM only if the Poisson over-dispersion ratio crossed 1.5.

The fitted value came in at about 1.40. That is high enough to keep negative binomial on the books, but below the pre-set threshold for switching immediately. So the Poisson GLM was retained, and negative binomial was recorded as the next model-family move rather than adopted opportunistically.

At the same time, empirical-Bayes shrinkage was added as a parallel ranking layer ((k )) for sparse link-level estimates, demoting high-AADT links with no observed collisions. It was adopted for that re-ranking behaviour rather than for stabilising the ranking — a cross-seed check found it did not reduce seed-induced ranking instability. A separate zero-calibration diagnostic rejected the strict Poisson assumption, with (p ), which is why negative binomial remains the next planned escalation.

The value of the evidence record here was restraint. It did not just say “the field uses negative binomial, therefore switch”. It supported a rule for when switching was justified.

What the workflow gives you

The workflow gives four things that a normal chatbot summary does not.

First, it creates an audit trail. You can trace a modelling decision back to the sources that justify it, and to the sources that limit it.

Second, it creates field expectations. That matters because model outputs are not self-interpreting. A pseudo-R² of 0.86 only looks impressive until the literature tells you it is implausible for the task.

Third, it creates disciplined restraint. A model can always suggest a more complex method. The evidence record helps decide whether the complexity is justified now, justified later, or not justified at all.

Fourth, it separates tool usefulness from evidence quality. A research agent can help find, summarise or compare sources, but the durable evidence is the source artefact, the extraction prompt, the recorded output, the limitation note and the later check against the original material.

Recommendations

The process is transferable beyond road-safety modelling.

Set the scope yourself. The model can sort, extract and sharpen, but it should not decide what the review is for.
Use tools for bounded jobs. Use research agents, search tools and citation tools for discovery, screening or comparison; do not treat their final report as the evidence record.
Use one narrow prompt per source. Do not ask the model to discover, extract and synthesise everything in one step.
Acquire the source yourself. Treat finding, opening and sanity-checking the paper or report as part of the review, not as clerical work to avoid.
Keep the source artefact defined. Record the file, URL or extraction target clearly enough that the same source can be passed to another model or checked by a person later.
Use a structured evidence record. Record claims, evidence, limitations and transferability.
Keep entries atomic. Synthesis should happen late, when you are answering a specific question.
Ground every claim in a real source. Treat unsupported model claims as leads, not facts.
Verify load-bearing claims manually. Check numbers, citations and interpretations against the original source.
Use the record as an audit tool. It should check existing work as well as inform new work.
Disclose the AI-assisted steps. The workflow should be visible enough that others can inspect it.
Publish one prompt-ablation example. Run the same source under a few input conditions — full paper, reference only, structured prompt, vague prompt — and publish the outputs so readers can inspect the difference.

The limit

This is not the fastest way to use AI. It is slower than asking for a summary and pasting the answer. It is slower again when a person has to find the source file, check that it is the right source, and make it available as an artefact before extraction.

That is the cost of making the work defensible.

The model does not provide expertise. Nor does a specialised research tool, by itself. These systems provide search, labour, pattern matching and a second pass over material that would otherwise be slow to organise. The expertise still has to be built by reading, checking, comparing and deciding what the evidence means for the project.

Used this way, AI does not replace the literature review. It helps build a review that can be checked. The point is not to find a tool that removes human judgement; it is to design the workflow so that human judgement is applied at the points where failure would matter.

For Open Road Risk, that was enough to make the literature record useful: it exposed a leaked result, constrained a model-family decision, and created an evidence base that the project can keep testing itself against.

AI-use disclosure

This note, and the literature workflow it describes, used large language models for source extraction, drafting and review. Other AI-assisted research tools were considered as context, but the project method was not built around any single product. The process was source-grounded and structured around per-source records, explicit limitations, defined source artefacts and human verification.

All figures, claims and citations were checked by the author against the project’s source artefacts and, where relevant, against the original publications. Final responsibility for the content is the author’s.

Sources

Further reading. On tools that productise parts of this workflow, start with the official product documentation for Elicit, Undermind, Semantic Scholar, Litmaps and scite rather than relying on comparison-list articles, which date quickly.

The project. Open Road Risk — code, pipeline and extraction prompts: https://github.com/ThomasHSimm/open-road-risk · literature-review pages and methodology: https://openroadrisk.org

Footnotes

Kalai, Nachum, Vempala & Zhang (2025), Why Language Models Hallucinate, arXiv:2509.04664 (OpenAI) — argues hallucination is a predictable consequence of training and evaluation that reward confident guessing over abstaining, and that it persists in state-of-the-art models. https://arxiv.org/abs/2509.04664 ↩︎
Liu et al. (2024), Lost in the Middle: How Language Models Use Long Contexts, Transactions of the ACL 12:157–173 — models use information best at the start and end of the input context and degrade in the middle, even long-context models. https://aclanthology.org/2024.tacl-1.9/↩︎
Walters & Wilder (2023), Fabrication and errors in the bibliographic citations generated by ChatGPT, Scientific Reports 13:14045 — 55% of GPT-3.5 and 18% of GPT-4 bibliographic citations fabricated. https://doi.org/10.1038/s41598-023-41032-5 ↩︎
Chelli et al. (2024), Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews, JMIR 26:e53164 — hallucination rates 39.6% for GPT-3.5, 28.6% for GPT-4 and 91.4% for Bard when reproducing systematic-review references. https://www.jmir.org/2024/1/e53164 ↩︎
Linardon et al. (2025), Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models, JMIR Mental Health — fabrication worse on specialised topics. https://mental.jmir.org/2025/1/e80371 ↩︎
Mugaanyi et al. (2024), Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing, JMIR — citation and identifier accuracy varies sharply by discipline. https://www.jmir.org/2024/1/e52935 ↩︎
Gartlehner et al. (2024), Data extraction for evidence synthesis using a large language model: a proof-of-concept study, Research Synthesis Methods 15(4):576–589 — Claude 2 at 96.3% extraction accuracy compared with human extraction. https://doi.org/10.1002/jrsm.1710 ↩︎
Schroeder, Jaldi & Zhang (2025), Large Language Models with Human-In-The-Loop Validation for Systematic Review Data Extraction, arXiv:2501.11840 (preprint) — 62%–72% consistency with human coding; argues for human-in-the-loop validation. https://arxiv.org/abs/2501.11840 ↩︎
AI-Assisted Data Extraction With a Large Language Model: A Study Within Reviews, Annals of Internal Medicine — AI-assisted extraction reported as 91.0% accurate while saving around 41 minutes per study; discusses the “second rater” pattern and TRIPOD-LLM reporting. https://www.acpjournals.org/doi/10.7326/ANNALS-25-00739 ↩︎
AI-Assisted Data Extraction With a Large Language Model: A Study Within Reviews, Annals of Internal Medicine — AI-assisted extraction reported as 91.0% accurate while saving around 41 minutes per study; discusses the “second rater” pattern and TRIPOD-LLM reporting. https://www.acpjournals.org/doi/10.7326/ANNALS-25-00739 ↩︎
Holst et al. (2025), Transparent Reporting of AI in Systematic Literature Reviews: Development of the PRISMA-trAIce Checklist, JMIR AI 4:e80247 — proposed checklist for reporting AI use in evidence synthesis. https://ai.jmir.org/2025/1/e80247 ↩︎
Use of AI tools in the publishing process (2026), Frontiers in Research Metrics and Analytics — summarises the ICMJE/COPE/WAME consensus: disclosure required, AI cannot be an author, and authors must verify output. https://www.frontiersin.org/journals/research-metrics-and-analytics/articles/10.3389/frma.2026.1740510/full ↩︎
OpenAI Help Center (2026), Deep research in ChatGPT — describes Deep Research as a tool for planning, researching and synthesising complex questions into a documented report using web sources, uploaded files and connected apps. https://help.openai.com/en/articles/10500283-deep-research-faq ↩︎
Elicit (2026), AI for Scientific Research — describes academic search, research reports, systematic-review support, source libraries and sentence-level citations. https://elicit.com/↩︎
Undermind (2026), Your AI co-researcher for the literature — describes literature search, report iteration, full-text work, alerts and traceable inline citations. https://www.undermind.ai/↩︎
Semantic Scholar (2026), AI-Powered Research Tool — describes a free AI-powered scientific-literature search service and API. https://www.semanticscholar.org/↩︎
Litmaps (2026), Literature review software for better research — describes discovery, visualisation, collaboration and monitoring tools for scientific literature. https://www.litmaps.com/↩︎
Scite.ai — citation-context tool that classifies citations as supporting, contradicting or mentioning a claim. https://scite.ai/↩︎
OpenAI (2025), Introducing deep research — notes that Deep Research can still hallucinate facts, make incorrect inferences, struggle to distinguish authoritative information from rumours, and fail to convey uncertainty accurately. https://openai.com/index/introducing-deep-research/↩︎
Lewis et al. (2020), Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 2020 — introduces RAG models that combine parametric sequence-to-sequence generation with retrieved non-parametric memory for knowledge-intensive tasks. https://arxiv.org/abs/2005.11401 ↩︎
Hallucination Mitigation for Retrieval-Augmented LLMs: A Review (2025), Mathematics 13(5):856 — review noting that retrieval-augmented systems can still produce unsupported output. https://www.mdpi.com/2227-7390/13/5/856 ↩︎
Retallack & Ostendorf (2020), Relationship Between Traffic Volume and Accident Frequency at Intersections, International Journal of Environmental Research and Public Health 17(4):1393 — open-access road-safety paper used for the prompt-ablation case study. https://doi.org/10.3390/ijerph17041393 ↩︎