I am pasting the report on the study below as it comes to my email and I don’t see it at the https://www.statnews.com/ website. But I found it valuable for insights on my expectations and doctor expectations with the use of “scribe technology” when I visit my doctor.
AI scribe studies: Does a “happy doctor” = good care?
Ambient AI medical scribes are the hottest tool in health care AI. The pitch is alluring: Instead of typing away at a computer during a patient visit, a clinician can turn their full attention to the patient while an AI tool listens, records, and summarizes the conversation, reducing the amount of time that the clinician spends documenting the appointment.
It’s the perfect use case: It plays to AI’s strengths in transcribing audio and summarizing text, decreases clinician burnout, makes the patient experience better, and offers opportunities to increase revenue by increasing patient appointments and upcoding with better documentation. And it doesn’t affect clinical decision-making or outcomes, thus avoiding FDA regulation and enabling quick adoption. A win-win-win.
But…are all of those things actually true?
Last week, a group of researchers at the University of Pennsylvania published a study in JAMA Network Open examining whether AI ambient scribes actually decrease clinical note burden. The study followed 46 clinicians at UPenn’s health system who used Microsoft-owned Nuance’s DAX Copilot AI ambient scribe for 5 weeks in spring 2024.
You can read the whole paper for the detailed breakdown, but the study combined electronic medical record tracking data with a clinician survey to determine both quantitatively and qualitatively whether the AI tool saved clinicians time. (It did; about 2 minutes per note and 15 minutes per day in after-hours “pajama time” — which is a far cry from the Nuance-endorsed statistic of a two-hour reduction in pajama time, which presenters repeated in a Microsoft-sponsored Medical Group Management Association webinar from January.)
There’s one qualitative finding from the UPenn study I’d especially like to point out: The researchers found that “the need for substantial editing and proofreading of the AI-generated notes, which sometimes offset the time saved” was a recurring theme in the clinicians’ comments. The product received a net promoter score of 0 on a scale of -100 to 100, with an equal number of people recommending (13) and not recommending (13) the product, and the rest of the survey respondents (11) responding passively.
Some clinicians in the study commented that they were entirely satisfied with the notes: “I quickly became comfortable that it would capture all critical elements of the conversation.”
But some were dissatisfied with the tool’s level of accuracy: “It tries to paraphrase the conversation, and often does it in a way that utilizes layman’s terms rather than medical terms; and often incorrectly documents what was discussed. This means that I must edit the content substantially because it cannot be used as-is in my closed note.”
The only ways to reconcile these two different experiences are: 1) The clinicians have different definitions of what’s acceptably accurate, 2) The AI tool is performing differently for different people, or 3) Some clinicians are checking the output carefully and some are not. To me, either of those latter two (very likely) explanations are alarming and show the need for studies looking at the accuracy of the output of AI scribes, not just whether physicians and patients are happier using the tool.
But the burnout and efficiency studies are the ones people are calling for — just a few weeks ago, Tina Shah argued in a Health Affairs Forefront that we need more studies about whether AI tools actually decrease provider burnout. Why? As disclosed at the very end of the article, Shah is the chief clinical officer of Abridge, one of the biggest AI ambient scribe companies, which only stands to gain from claims of efficiency and burnout reduction.
While we do need those studies, have you heard similar calls for studies of AI scribe accuracy and safety? MedStar Health Research Institute recently conducted an anemic version of this where they transcribed 11 real patient encounters, de-identified them, and then had staff re-enact them for two different ambient scribes (the exact products were not identified in the study) and concluded that there were “frequent errors,” often of omission.
These studies aren’t just pandering to overly worried patients — doctors should be calling for these quantitative studies too because they are betting their livelihoods on AI being accurate every time they click “OK” without thoroughly checking. In an invited commentary that was co-published with the UPenn study, Harvard bioethicist and health law expert I. Glenn Cohen and colleagues lay out legal and ethical issues with medical AI scribe tools. They echo an FDA advisory committee’s conclusion that these AI tools aren’t without risk to patients: “The possibility of that product hallucinating can present the difference between summarizing a health care professional’s discussion with a patient and providing a new diagnosis that was not raised during the interaction.”
The electronic health record is of paramount importance in any medical malpractice case, write Cohen and colleagues. The law traditionally holds clinicians responsible for the accuracy of patient records, and it’s all too easy, as the clinician in the UPenn study said, to “quickly [become] comfortable” that the AI-generated note is accurate and not carefully check the record every time. The difference between a “point 5 milligram” and “5 milligram” prescription is small to an ambient listening tool but big to a physician who doesn’t catch the mistake.
To date, I’ve not seen any large-scale, independent studies of the accuracy of these AI scribes, despite the fact that large health systems with the ability to do these studies, like Mass General Brigham and Cleveland Clinic, have piloted multiple scribes and chosen one product over another. Let me know if I’m wrong about the lack of studies or if you want to talk about your health system’s experience (reply to this email or hit up aiprognosis@statnews.com).
Proving that these tools provide a return on investment and decrease doctor burnout is important. But accuracy and safety often gets lost in the kind of math that clinicians and consultants presented in that Microsoft-sponsored call about DAX: “Happy doctor equals happy patient, happy patient equals good patient experience and good care, right?”