Researchers at Duke University are proposing a new framework to evaluate artificial intelligence scribing tools by using a combination of human review and technological evaluation. The tools, while widely deployed, don’t have a shared framework for assessment.
AI scribe companies have announced eye-popping funding rounds in June. French company Nabla raised a $70 million series C. Abridge raised a $300 million series E round, and Commure announced a $200 million raise. Other AI scribe startups Ambience and Suki both raised $70 million in 2024.
Despite the venture dollars pouring into the technology that promises to relieve staff burnout, health systems lack a standard way to evaluate and perform oversight for the technologies once deployed.
Most healthcare delivery organizations rely on human reviewers to assess the performance of AI scribes, the Duke researchers wrote in a paper published in NPJ Digital Medicine. Human review is essential for understanding the nuance of clinical workflows and the delivery of care, but the approach is time intensive, expensive and subjective.
Some systems also use automated evaluations to test the accuracy of speech recognition and to compare computer-generated texts to human-generated ones.
The existing automated evaluations for AI scribes include ROUGE, which evaluates an automatically generated text with a reference text; Word Error Rate, a test of accuracy of automatic speech recognition technology; and F1 scores, a metric that demonstrates precision and recall.
While these tests can be useful, they aren’t tailored to understand the performance of an algorithm in the context of clinical workflow, the paper says. Further, there’s little correlation to the conclusions of human reviewers.
The Duke researchers combined the approaches—and threw in a few other techniques for good measure—to create their evaluation and governance method, called SCRIBE. They hope it will allow hospitals and health systems to more easily compare commercial AI scribe tools and effectively evaluate their performances over time.
“The design is guided by the principle that no single method can comprehensively capture all performance dimensions,” the paper says.
To test the usefulness of the SCRIBE evaluation framework, the researchers built an ambient dictation scribe (ADS) tool in-house and deployed it across 40 clinical visits. The AI scribe was built using publicly available models Whisper Large Turbo 3 and GPT-4o so that the model was transparent to the researchers and could be used in the future as a benchmark for testing commercial products.
The study used an AI scribe to transcribe 40 clinician-patient interactions for smoking cessation in pregnant women and generate SOAP notes for the encounters. Then, the researchers used four techniques to evaluate the quality of the note.
Human reviewers compared the AI-generated transcripts to the audio of the encounters, with a focus on misheard words and missing or extraneous information. To review the SOAP note, the evaluators used a standard rubric that assessed a slew of dimensions.
The study notes that human reviewers had a bias toward factuality and prudence. The two human reviewers agreed 53% of the time on the quality of the note, which was expected given the complexity of the cases.
The study also explored the use of large language models to decrease the workload of human reviewers. When assessing quality dimensions of the notes, the LLMs showed strong correlation to the output of human reviewers in relevance, completeness and comprehension. LLMs agreed with humans least on coherence and comprehension.
The paper lays out one way it leveraged LLMs to reduce manual labor: “Instead of having human evaluators manually extract and verify all facts, we used GPT-4o to automatically extract key facts from the transcript, reference medical notes, and AI-generated notes. Again, we used GPT-4o to determine which key facts appear in both the reference and generated notes (LINK) and whether those linked facts retain their intended meaning (CORRECT).”
For transcription accuracy and diarization accuracy, the researchers used an expanded set of auto evaluation techniques including the ROUGE, WER and F1 scores.
To assess the accuracy of the medical note, the researchers used human reviewers, auto evaluation, trained auto evaluation and LLMs. They also used the combined approach to assess fluency, coherence, clarity, brevity, structuring, relevance, completeness and factuality of the AI-generated medical notes.
Simulation review allows teams to test edge case scenarios that would be underrepresented in natural data sets. Examples the paper included were the diagnosis of a rare disease or a newly developed pharmaceutical product.
In 60% of cases, they found the scribe did not flag the nonsensical information and entered it into the note. In some cases, the scribe automatically changed the values to make sense, but it did not alert end users. In 4% of cases, the scribe flagged the nonsensical value.
The reviewers used simulation to assess stereotype bias, fairness and adversarial.
“Using SCRIBE, we found that our internally developed ADS tool generally performs well across multiple metrics and human assessments, with particular strengths in clarity, completeness, and relevance,” the paper says.
The researchers emphasize the need to maintain a human in the loop to evaluate the performance of ADS tools. The paper also noted that the framework was developed using principles set out by the Coalition for Health AI (CHAI).
Many of the researchers are involved with the nonprofit AI group. CHAI has a work group for AI scribes, and it may use AI scribes as a first use case for its yet-to-be-released outcomes website, where health systems would give feedback on AI vendor products.
In the future, the researchers hope to test the framework on a commercially available ADS tool with an AI scribe vendor. They also plan to conduct a multisite study with several ADS products to systematically compare them and assess their impact on patient care.
“According to human evaluation, auto evaluation and LLM evaluation, GPT-based notes outperformed those generated by LLaMA, underscoring the capacity of the proposed evaluation framework in detecting differences in ADS products,” the paper says.
The paper claims health systems could do a head-to-head comparison of all 50-plus AI scribe vendors on the market.