Guides
AI Hallucinations in Medical Scribes: The Risk Nobody Should Ignore
Every AI medical scribe demo looks flawless. The risk that decides whether a scribe is safe to use isn't in the demo — it's the hallucination: the moment the model writes something into a clinical note that the clinician never said and the patient never reported. This guide explains what that actually means in ambient documentation, why it's more dangerous here than almost anywhere else AI is used, and how to test for it before a tool touches a real patient record.
What "hallucination" means in a clinical note
In general AI, a hallucination is a confident, fluent statement that is simply false. In an ambient scribe it takes specific, dangerous forms:
- Invented content — a symptom, medication, dose, allergy, or examination finding that was never discussed appears in the note because it's statistically plausible for that kind of visit.
- Confabulated specifics — the conversation said "blood pressure was a bit high"; the note says "BP 158/96." A number that sounds clinical but was never measured.
- Misattributed negatives/positives — "denies chest pain" becomes "reports chest pain," or a normal finding is recorded for a system that was never examined.
- Template bleed — the model fills a familiar template section with the typical content for that template rather than what happened.
- Silent omission — less discussed, equally dangerous: the model drops a stated red-flag symptom because it didn't fit the generated structure.
The common thread: the output is fluent and clinically convincing, which is exactly what makes it hard to catch on a tired end-of-day review.
Why it's worse in documentation than almost anywhere else
A hallucinated answer in a consumer chatbot is annoying. A hallucinated line in a clinical note is a different category of problem:
- It becomes the legal record. Once signed, the note is the medico-legal account of the encounter. An invented finding is now evidence.
- It propagates. The note flows into the problem list, the discharge summary, the referral letter, the next clinician's mental model, and sometimes the billing code. One fabricated detail can be inherited by every downstream decision.
- Review fatigue is real. The entire value proposition of a scribe is that you don't re-type the note. That same convenience reduces the scrutiny each line gets — the error surface and the safety control are in tension by design.
- Plausibility defeats spot-checking. Humans catch implausible errors. Scribe hallucinations are, by construction, plausible.
This is why "the AI is 95% accurate" is the wrong frame. The question is not the average — it's what the 5% looks like and whether you'll catch it.
How serious vendors mitigate it
Good tools treat hallucination as a primary design constraint, not a footnote. Approaches that matter:
- Grounding / extractive bias — generating notes that stay close to what was actually said rather than freely paraphrasing into clinical prose.
- "No fabrication by design" — some vendors, such as Playback Health, explicitly position the product around not inferring or inventing clinical content.
- Traceability — linking statements in the note back to the moment in the transcript that supports them, so review is fast and verifiable.
- Conservative defaults — preferring to omit and flag uncertain content rather than confidently assert it (and then surfacing those gaps to the clinician).
- Human-in-the-loop, by design — making the review step structural, not optional.
What none of this removes: the clinician remains accountable for the signed note. Every credible vendor says this; so do we.
How we test for hallucination
This is the single hardest thing to learn from a vendor's website — which is exactly why we don't try to. On CompareScribes, clinical precision and note quality are scored by hands-on testing, not from spec sheets (see our methodology). When we test, hallucination is the thing we hunt for specifically:
- We run real-style encounters with deliberately ambiguous, negative, and omitted details.
- We check whether numbers, medications and findings in the note were actually said.
- We look for the quiet failure modes — misattributed negatives and dropped red flags — not just obvious invention.
What we found varies more than vendors admit. Most tools are good but occasionally confabulate a specific — a wrong dose, a missing allergy, a treatment the patient never received. A few are noticeably tighter than others, and the gap shows up most in dense specialist consultations where the audio carries more named entities than a polite paraphrase can preserve. But no tool should ever be trusted to be hallucination-free; the right posture is "trust, but always verify the signed note." Our editorial scores include hands-on testing for clinical precision and note quality, and the per-tool reviews flag the failure modes we observed — but the only safe operating model is independent verification by the signing clinician on every encounter.
A practical checklist before you trust a scribe
- Run adversarial trials. In your trial, deliberately not mention something you'd expect, and mention something vague — then read what the note claims.
- Check the numbers. Any vital, dose or lab value in the draft: was it actually stated? If the tool invents numbers, stop there.
- Test negatives. Say "no chest pain, no shortness of breath" and confirm the note doesn't flip a negative.
- Test omission. Mention one red-flag symptom briefly and see whether it survives into the note.
- Ask the vendor the direct question — "What does your product do when it's unsure: invent, omit, or flag?" The answer is revealing.
- Keep the human review structural. Whatever you buy, never sign an unread note. The scribe removes typing, not responsibility.
Bottom line
Hallucination is the defining safety question for AI scribes, and it's invisible in a sales demo. Evaluate it deliberately, with adversarial trials, on your own encounters — and weight it heavily. It's why we test note quality by hand rather than trusting marketing copy. Start from the full ranking, read each tool's verdict and tested score, and trial your shortlist with the checklist above before a single real patient note depends on it.