Citation integrity

Fabricated citations in AI research tools: the data, and how to avoid them

By Antonio Brundo · 10 June 2026 · Updated 10 June 2026

Direct answer: Three large 2026 studies measured fabricated citations in AI research tools. Across 200,000+ citation URLs from commercial LLMs and deep research agents, 3–13% were hallucinated outright and up to 18.5% failed to resolve; a 2.2-million-citation benchmark found per-model reference hallucination rates from 14% to 95%; and trajectory-level audits of six deep research agents found fabrication rates of 10–15% even in the best systems. Detection-after-the-fact cannot get this to zero. The only architecture that can is verification by construction: a citation enters the document only after its identifier resolves against an authoritative registry.

The 2026 evidence, with numbers you can check

Fabricated references stopped being an anecdote in 2026; they became a measured phenomenon. Three studies are worth reading in full (we verified the figures below against the papers before citing them — which is rather the point of this article):

1. Commercial LLMs and deep research agents: 3–13% hallucinated URLs

Rao, Wong & Callison-Burch (arXiv:2604.03173, April 2026) audited citation URLs from ten commercial models and agents on DRBench (53,090 URLs) and three models on ExpertQA (168,021 URLs). Findings: 3–13% of citation URLs are hallucinated — no record in the Wayback Machine, likely never existed — and 5–18% fail to resolve overall. Strikingly, the dedicated deep research agents performed worst: Gemini 2.5 Pro Deep Research showed a 13.3% hallucination rate and 18.5% non-resolving citations, against ~3% for the best chat models. Longer, more agentic pipelines produced more fabricated references, not fewer.

2. GhostCite: 2.2 million citations, hallucination rates from 14% to 95%

GhostCite (arXiv:2602.06718, February 2026) analysed 2.2 million citations across 56,381 papers and benchmarked 13 LLMs on citation generation. Per-model reference hallucination rates ranged from 14.23% to 94.93%. The study also found fabricated citations escaping into the published record: about 1.07% of papers at AI/ML and security venues from 2020–2025 contain invalid citations, with an 80.9% increase in 2025 — and surveyed reviewers admitted they rarely verify references during peer review.

3. Inside the research trajectory: fabrication plus misattribution

DeepHalluBench (arXiv:2601.22984, January 2026) audited six deep research agents across the full plan–search–summarise trajectory rather than just the final answer. Even the best systems showed fabrication rates of roughly 10–15% and misattribution rates up to ~22% — the citation exists, but does not say what the agent claims it says. Every evaluated system exhibited what the authors call "non-negligible reliability gaps."

Two distinct failure classes emerge: fabrication (the reference does not exist) and misattribution (it exists but is misused). A trustworthy research tool must address both.

Why this happens: generation is not retrieval

A language model writing a reference from its parameters is performing pattern completion: plausible author names, a plausible journal, a plausible DOI shape like 10.1016/j.xxxx.2023.104.... Nothing in autoregressive generation checks that the identifier was ever registered. Retrieval-augmented systems reduce the problem but reintroduce it at the seams — when the model summarises beyond what was retrieved, merges two sources into one citation, or "remembers" a reference the retriever never returned. That is why the deep research agents in the studies above, with their long multi-step pipelines, accumulated more citation errors: every synthesis step is another opportunity to drift from the retrieved evidence.

Why post-hoc verification can't reach zero

The intuitive fix — generate first, verify afterwards — has a structural ceiling. A checker that catches 90% of fabricated references still ships 1 in 10; the Rao et al. paper itself frames detection-and-correction as mitigation, not elimination. Post-hoc checking also does nothing about misattribution unless it compares the claim against the source content, and it puts the burden at the worst possible moment: after the document looks finished, when humans are least inclined to challenge it. GhostCite's reviewer survey confirms what everyone suspects — downstream readers do not verify references either.

Verified by construction: making the error class impossible

The alternative is architectural. In AutoSearch, a citation is not text the model writes — it is a record the pipeline retrieved. Before any reference can enter the manuscript:

The DOI candidate is normalised and resolved live against Crossref; an identifier with no registry record cannot become a citation, full stop (the pipeline is documented step by step in how we verify DOIs via Crossref).
Returned metadata (title, year, venue) is compared against the retrieved source, catching the real-DOI-wrong-paper case.
A semantic relevance check verifies that the cited work actually supports the statement it is attached to — addressing misattribution, the second failure class.
Anything that fails stays visible in the evidence log as unverified instead of being silently dressed up as a reference.

This is what "zero fabricated DOIs by construction" means: not a model that hallucinates less, but a pipeline in which a fabricated DOI has no path into the output. The claim is falsifiable — pick any citation in any generated paper and resolve its DOI yourself. The full disclosure model is on the methodology page, and the comparison page shows which tools in this market verify citations against what.

The regulatory stakes are higher than the academic ones

In academia, a fabricated reference is an integrity problem. In regulated documents it is a compliance problem: a clinical evaluation report under EU MDR must rest on a documented, verifiable literature review, and a citation that does not resolve is an immediate credibility finding for a notified body auditor — it calls the entire search documentation into question. Courts have already sanctioned lawyers for AI-invented case citations; regulatory reviewers are now equally alert. Teams drafting CERs and PMCF reports with AI assistance should treat citation verification as a gating control, not a final polish (see our EU MDR/IVDR use case and what MDR Article 61 requires from a literature review).

How to test any research tool in five minutes

Ask for a literature review on a topic you know well, with full references.
Resolve every DOI at doi.org and every URL. Count failures.
For three citations that resolve, open the paper and check it actually supports the sentence citing it.
Ask the tool to show its search log: which sources, which queries, what was excluded. If it can't, you are trusting, not verifying.

Run the same test on AutoSearch — that is precisely the test the architecture is built to pass. For how this compares across the current tool landscape, including tools with strong screening workflows but thinner verification, see our honest comparison of systematic review tools.

FAQ

Do the studies say AI research tools are unusable?

No. They say unverified citation generation is unreliable at rates between a few percent and catastrophic, depending on the system — and that the fix must be architectural. Tools that retrieve, verify, and log can be more auditable than rushed manual work.

Does DOI verification catch every problem?

It eliminates fabricated identifiers and wrong-paper attachments among DOI-bearing sources, and the semantic check addresses misattribution. Sources without DOIs (regulations, trial records, patents) carry other identifiers and are labelled by type rather than blessed with false confidence. No system removes the need for human appraisal of the evidence itself.

Weren't the fabrication numbers even higher in some reports?

Press coverage sometimes blends different metrics (non-resolving links, fabricated references, misattributed claims). The per-study figures above come from the papers themselves; GhostCite's 14–95% range shows how dramatically rates vary by model — which is exactly why per-tool verification beats reputation.

Verify, don't trust

Run a review where every citation is checked before it reaches the page: read the methodology, compare tools on the comparison page, or start a free run and resolve the DOIs yourself.