AI Citation Benchmarks: How Reliable Are LLM References in 2026?

Why AI citation accuracy matters now

Large language models now routinely generate answers with clickable citations, but multiple studies show that a worrying share of those references are wrong, fabricated, or only loosely related to the claims they supposedly support.¹

Across several recent evaluations, only about a quarter to half of AI-generated references are fully reliable, and even top-tier models still fail to ground many statements in the sources they cite.²¹

Even when an LLM looks “citation rich,” benchmarks consistently show that a large fraction of its claims are not actually supported by the links it provides.²

For anyone relying on AI for literature review, technical blogging, or SEO-driven content, understanding these benchmarks is now a core research skill rather than a nice-to-have.³¹

What an “AI citation benchmark” actually measures

Most AI benchmarks were originally designed to measure accuracy on tasks like multiple-choice questions or reasoning, not how honestly models use sources.⁴³

Citation-focused benchmarks fill that gap by measuring how often an LLM’s references are real, relevant, and actually support the text they’re attached to.⁵²

Common metrics you’ll see:

URL validity – share of cited URLs that return a 200 status code and non-empty content.²
Statement-level support – percentage of individual statements in a response that are supported by at least one cited source.²
Response-level support – percentage of full answers where all statements are supported by the cited sources.²
Hallucination rate for references – share of references that are fabricated, have major bibliographic errors, or don’t exist in bibliographic databases.¹
Reference Hallucination Score (RHS) – a combined score (0–11) that weights errors in titles, authors, DOIs, dates, links, and topical relevance for each reference.⁵

You’ll also see classic IR metrics like precision, recall, and F1 being applied to AI-generated reference lists when the benchmark has a known “gold standard” bibliography.¹

The major citation benchmarks you should know

A surprising amount of the hard data about AI citations is concentrated in a small set of medical and scientific benchmarks.³

The table below summarises the most influential recent work focused specifically on reference quality.

Key AI citation benchmarks at a glance

Study / Benchmark	Domain & Task	Models Evaluated	Core Citation Metrics	Headline Result
SourceCheckup (Wu et al., 2025, Nat. Commun.)²	Medical Q&A with web and non‑web LLMs	GPT‑4o (API & web/RAG), Claude, Gemini, Mistral, Mixtral, Llama‑2, Meditron	URL validity, statement‑level support, response‑level support	50–90% of responses across models were not fully supported by their own citations; even GPT‑4o with web search achieved only 55% fully supported responses.²
Fabrication & errors in ChatGPT citations (Sci Rep 2023)⁶	General academic writing	GPT‑3.5 & GPT‑4	Fabrication rate, error types in 636 citations	Documented widespread fabricated or erroneous references in AI‑written bibliographies, motivating dedicated citation benchmarks.⁶
Systematic review replication (Chelli et al., JMIR 2024)¹	Orthopaedic systematic reviews	GPT‑3.5, GPT‑4, Bard (Gemini)	Precision, recall, hallucination rate for references	Hallucination rates of 39.6% (GPT‑3.5), 28.6% (GPT‑4), and 91.4% (Bard) for generated references; precision 9.4–13.4% and recall 11.9–13.7%.¹
Reference Hallucination Score (RHS) benchmark (Aljamaan et al., JMIR Med Inform 2024)⁵	Mixed medical topics, 10 prompts × 10 references each	ChatGPT 3.5, Bing, Perplexity, Elicit, SciSpace, Bard	Per‑reference hallucination score (0–11)	ChatGPT and Bing had the highest hallucination scores (median RHS 11), Perplexity scored mid‑range (7), while SciSpace and Elicit scored as low as 1; Bard failed to produce references.⁵
Multi‑model academic citation analysis (Enago summary, 2025)¹	Academic bibliographic retrieval across multiple topics	Multiple unnamed LLMs, plus domain‑specific nephrology study	% correct references, % fabricated	One study found only 26.5% of AI‑generated references were entirely correct and nearly 40% were erroneous or fabricated; a nephrology study found only 62% of suggested references existed, with 31% fabricated or incomplete.¹
PaperAsk reliability benchmark (ACM, 2026)⁷	Research tasks including citation retrieval and question answering about papers	Multiple LLMs (details vary)	Citation retrieval accuracy, consistency across research tasks	Introduces a multi‑task reliability benchmark where citation retrieval is one of four key capabilities used to evaluate how trustworthy LLMs are for research workflows.⁷

Collectively, these studies converge on a single message: you cannot safely treat AI citations as “plug‑and‑play” references without verification, even when the model is state of the art.¹²

How leading models perform on citation accuracy

The raw numbers are sobering once you look at model‑by‑model performance instead of generic “AI hallucination” headlines.

1. Medical Q&A with SourceCheckup

SourceCheckup generated 800 medical questions (half from Mayo Clinic pages, half from Reddit r/AskDocs) and evaluated roughly 58,000 statement–source pairs across seven LLMs.²

Its automated evaluation, validated against US‑licensed physicians (≈89% agreement), scored each model on URL validity, statement‑level support, and response‑level support.²

Some key findings:

Models without web access produced valid URLs only 40–70% of the time; GPT‑4o’s API variant reached ~70% valid URLs but still often cited pages that didn’t support its statements.²
With web search (RAG), GPT‑4o achieved the best overall support, yet only 55% of its responses were fully supported by the citations it provided.²
Even with RAG, ~30% of GPT‑4o’s individual statements were unsupported by any cited source, and human doctors confirmed that 105 of 110 “unsupported” statement–source pairs indeed lacked support.²
Gemini Ultra 1.0 with RAG had only 34.5% of responses fully supported by its retrieved references, and Gemini Pro’s API version managed only about 10% fully supported responses.²
Open‑source medical models like Llama‑2‑70B and Meditron‑7B produced valid URLs in <5% and <1% of responses respectively, effectively failing at the basic task of citation.²

On an external health benchmark (HealthSearchQA), GPT‑4o with RAG achieved 100% URL validity but only 75.7% statement‑level support and 38.4% response‑level support, again showing that “lots of good links” does not mean the whole answer is properly sourced.²

2. Systematic reviews: can LLMs replicate expert bibliographies?

Chelli et al. took 11 human‑authored systematic reviews on rotator‑cuff pathology and asked GPT‑3.5, GPT‑4, and Bard to retrieve randomized trials that met the same inclusion criteria.¹

They then compared each AI‑produced reference list to the original systematic reviews using classic retrieval metrics:

Precision – how many AI‑proposed papers were actually in the target review?
- GPT‑3.5: 9.4% (13/139)
- GPT‑4: 13.4% (16/119)
- Bard: 0% (0/104)¹
Recall – how many of the review’s real papers did the model recover?
- GPT‑3.5: 11.9% (13/109)
- GPT‑4: 13.7% (15/109)
- Bard: 0% (0/109)¹
Hallucination rates for references – proportion of AI‑provided references that were judged hallucinated (title, first author, or year wrong in ≥2 fields):
- GPT‑3.5: 39.6% (55/139)
- GPT‑4: 28.6% (34/119)
- Bard: 91.4% (95/104)¹

In other words, even GPT‑4 recovered only about one in seven of the actual trials and hallucinated almost one in three of the references it suggested.¹

3. Reference Hallucination Score (RHS): per‑reference scoring

Aljamaan et al. proposed the Reference Hallucination Score (RHS) to quantify how “clean” individual AI citations are across seven bibliographic fields (title, journal, authors, DOI, date, URL, topical relevance).⁵

In their benchmark:

Six AI tools were tested on 10 medical prompts, each asked to return 10 references, totalling 500 references.⁵
Bard failed to produce usable references for any of the prompts.⁵
Across all tools, the most common hallucination dimension was relevance: 61.6% of references were off‑topic relative to the prompt’s keywords (308/500).⁵
Errors were also very frequent in publication dates (47.4%), author names (45.6%), DOIs (45.4%), journal names (37.6%), and URLs (37.4%).⁵

Median RHS by tool (0 = no hallucination, 11 = maximum):

ChatGPT 3.5: 11 (highest hallucination)
Bing: 11 (similar to ChatGPT)
Perplexity: 7 (mid‑range)
SciSpace: 1
Elicit: 1
Bard: no references generated, treated as failure cases⁵

These results suggest that research‑specialised tools (Elicit, SciSpace) can massively outperform general chatbots on citation accuracy, even when they’re powered by similar underlying models.⁵

Visualising the numbers (no‑JS charts)

You asked for charts without JavaScript, so here are simple text‑based visuals you can drop straight into your blog.

Hallucinated references by model (systematic review study)

Hallucinated references (share of AI-generated citations that were fabricated or seriously wrong)

GPT-4           ████████████████░░░░░░░░      28.6%
GPT-3.5         ████████████████████░░░       39.6%
Bard / Gemini   ███████████████████████████   91.4%

Source: JMIR study on LLMs replicating systematic reviews for rotator cuff disease.¹

Support rates for GPT‑4o with web search (medical Q&A)

GPT‑4o (with RAG / web search) on medical questions

URL validity                ██████████████████████████  ~100%
Statement-level support     ██████████████████░░░░░░░░  70–76%
Response-level support      ████████░░░░░░░░░░░░░░░░   38–55%

On Reddit‑like, open‑ended questions, only 31–42% of GPT‑4o responses were fully supported by their citations.²
On more structured Mayo Clinic questions, response‑level support was closer to 80%.²

Source: SourceCheckup benchmark and HealthSearchQA subset.²

Practical patterns across benchmarks

When you line these studies up, several consistent patterns emerge.

1. “Having citations” ≠ “being grounded”

Across benchmarks, models often return long lists of plausible‑looking URLs while still leaving many statements unsupported or even contradicted by the cited sources.¹²

SourceCheckup found that even when GPT‑4o and other models were forced to provide sources, between 50% and 90% of responses across models were not fully supported by those citations.²

2. Open‑domain chatbots struggle with rigorous retrieval

In systematic reviews, GPT‑3.5 and GPT‑4 achieved recall of only 11.9–13.7% against human gold‑standard bibliographies, and precision topped out at 13.4%, meaning most of what they returned was not in the actual review.¹

This aligns with Enago’s summary of a multi‑model study where only 26.5% of AI‑generated references were fully correct and nearly 40% were erroneous or fabricated.¹

3. Task and prompt style matter

SourceCheckup showed much higher citation support on structured questions derived from Mayo Clinic pages than on messy, user‑generated Reddit questions, where support rates dropped sharply.²

The RHS study similarly found that complex, scenario‑based prompts tended to produce significantly higher hallucination scores than simpler prompts, although the effect varied by topic.⁵

4. Specialised research tools can outperform general chatbots

While ChatGPT and Bing had the highest RHS (worst hallucination), SciSpace and Elicit delivered almost no hallucination on the same prompts, largely because they are wired directly into scholarly databases and tuned for conservative retrieval.⁵

That mirrors the broader industry trend: evaluation roundups list many retrieval‑centric and attribution‑centric benchmarks, and enterprise teams increasingly layer their own domain‑specific tests on top of public leaderboards.⁸³

Designing your own AI citation benchmark

If you’re a researcher, librarian, or content strategist, you don’t have to build a full Nat. Commun. paper to benchmark citation behaviour in your own domain. The existing work gives you a template.⁵²

Here’s a pragmatic, minimal benchmark you can run:

Curate 20–50 questions or prompts.
- Mix structured, fact‑based questions (e.g., from guidelines, docs, or documentation pages) with real user queries from search logs or forums.²
Define a gold‑standard reference set.
- For each prompt, list 5–20 high‑quality sources (papers, docs, pages) that truly support correct answers. This mirrors the “systematic review gold standard” in the JMIR study.¹
Collect AI answers + citations.
- Ask each model to answer and require it to provide citations in a structured format (bulleted URLs or reference list).
Score URL validity and hallucination.
- Check each URL for HTTP 200 and non‑empty content.
- Use CrossRef, PubMed, or Google Scholar to confirm whether bibliographic entries exist and match titles, authors, and years, following the RHS methodology.⁵
Score statement‑ and response‑level support.
- Split answers into individual statements and, for each, ask: “Is this statement supported by at least one cited source?” following the SourceCheckup definitions.²
Track key metrics over time.
- URL validity, statement-level support, response-level support, and per‑reference RHS give you a compact dashboard for model comparisons.⁵²

If you want to go further, you can adopt ideas from PaperAsk and similar leaderboards to add reliability dimensions like consistency across re‑runs, robustness to slightly rephrased prompts, and sensitivity to ambiguous questions.⁹⁷

Recommendations for using AI-generated citations safely

Given what the benchmarks show, here’s how to treat AI citations in practice.

For researchers and students

Use AI to brainstorm, not to source. Let the model suggest keywords, related concepts, and possible authors, but always run your actual literature search in primary databases such as PubMed, Web of Science, or Scopus.¹
Treat every AI citation as an unverified lead. Verify DOIs, titles, authors, journal names, and publication years in CrossRef or database search before citing anything.⁵¹
Check that the paper truly supports your claim. Benchmarks show many AI‑suggested papers are real but irrelevant, especially on complex prompts where relevance hallucination rates exceed 60%.⁵
Document your search and verification steps. For systematic work, follow PRISMA or similar frameworks and clearly separate AI‑assisted brainstorming from human‑verified sourcing.¹

For content marketers and SEO teams

Never paste AI citations into content without a manual check. Hallucination rates in the 25–40% range for references mean “auto‑citations” are a legal and reputational risk.¹
Prefer RAG‑based or research‑specialised tools for source discovery. Benchmarks show tools like Elicit and SciSpace dramatically reduce bibliographic hallucination, especially when you constrain them to your niche.⁵
Standardise a verification checklist. At minimum: visit every URL, check that the page exists, confirm that it states what your copy claims, and ensure it’s from a credible domain.⁵²
Build internal “citation benchmarks” as a governance tool. Periodically sample AI‑authored content, score it against a simple SourceCheckup‑style framework, and feed the results back into your editorial guidelines.²

For teams building AI products

Log and audit citations as first‑class telemetry.
- Track URL validity, support rates, and RHS‑style scores as core product metrics, not afterthoughts.⁵²
Use LLM‑as‑a‑judge cautiously—but do use it.
- Both SourceCheckup and RHS successfully used strong LLMs to automate much of the scoring, with 87–89% agreement with experts.⁵²
Combine retrieval, training, and editing.
- SourceCheckup’s SourceCleanup agent was able to remove or re‑write about 90.7% of unsupported statements so that they became supported by the original sources, showing that post‑editing workflows can measurably improve faithfulness.²
Expose uncertainty and limitations to users.
- Most benchmarks conclude that current models should not be used as the primary or exclusive tool for tasks like systematic reviews; your UX and documentation should reflect that.¹

¹⁰¹¹¹²¹³¹⁴¹⁵

⁂

// raw_markdown · agentic_mode