AI and Machine Learning

Duke proposes evaluation framework for AI scribes as VC dollars pour in

By Emma Beavins Jun 30, 2025 1:00pm

Researchers at Duke University are proposing a new framework to evaluate artificial intelligence scribing tools by using a combination of human review and technological evaluation. The tools, while widely deployed, don’t have a shared framework for assessment.

AI scribe companies have announced eye-popping funding rounds in June. French company Nabla raised a $70 million series C. Abridge raised a $300 million series E round, and Commure announced a $200 million raise. Other AI scribe startups Ambience and Suki both raised $70 million in 2024.

Despite the venture dollars pouring into the technology that promises to relieve staff burnout, health systems lack a standard way to evaluate and perform oversight for the technologies once deployed.

Most healthcare delivery organizations rely on human reviewers to assess the performance of AI scribes, the Duke researchers wrote in a paper published in NPJ Digital Medicine. Human review is essential for understanding the nuance of clinical workflows and the delivery of care, but the approach is time intensive, expensive and subjective.

Some systems also use automated evaluations to test the accuracy of speech recognition and to compare computer-generated texts to human-generated ones.

The existing automated evaluations for AI scribes include ROUGE, which evaluates an automatically generated text with a reference text; Word Error Rate, a test of accuracy of automatic speech recognition technology; and F1 scores, a metric that demonstrates precision and recall.

While these tests can be useful, they aren’t tailored to understand the performance of an algorithm in the context of clinical workflow, the paper says. Further, there’s little correlation to the conclusions of human reviewers.

The Duke researchers combined the approaches—and threw in a few other techniques for good measure—to create their evaluation and governance method, called SCRIBE. They hope it will allow hospitals and health systems to more easily compare commercial AI scribe tools and effectively evaluate their performances over time.

“The design is guided by the principle that no single method can comprehensively capture all performance dimensions,” the paper says.

To test the usefulness of the SCRIBE evaluation framework, the researchers built an ambient dictation scribe (ADS) tool in-house and deployed it across 40 clinical visits. The AI scribe was built using publicly available models Whisper Large Turbo 3 and GPT-4o so that the model was transparent to the researchers and could be used in the future as a benchmark for testing commercial products.

The study used an AI scribe to transcribe 40 clinician-patient interactions for smoking cessation in pregnant women and generate SOAP notes for the encounters. Then, the researchers used four techniques to evaluate the quality of the note.

Human reviewers compared the AI-generated transcripts to the audio of the encounters, with a focus on misheard words and missing or extraneous information. To review the SOAP note, the evaluators used a standard rubric that assessed a slew of dimensions.

The study notes that human reviewers had a bias toward factuality and prudence. The two human reviewers agreed 53% of the time on the quality of the note, which was expected given the complexity of the cases.

The study also explored the use of large language models to decrease the workload of human reviewers. When assessing quality dimensions of the notes, the LLMs showed strong correlation to the output of human reviewers in relevance, completeness and comprehension. LLMs agreed with humans least on coherence and comprehension.

The paper lays out one way it leveraged LLMs to reduce manual labor: “Instead of having human evaluators manually extract and verify all facts, we used GPT-4o to automatically extract key facts from the transcript, reference medical notes, and AI-generated notes. Again, we used GPT-4o to determine which key facts appear in both the reference and generated notes (LINK) and whether those linked facts retain their intended meaning (CORRECT).”

For transcription accuracy and diarization accuracy, the researchers used an expanded set of auto evaluation techniques including the ROUGE, WER and F1 scores.

To assess the accuracy of the medical note, the researchers used human reviewers, auto evaluation, trained auto evaluation and LLMs. They also used the combined approach to assess fluency, coherence, clarity, brevity, structuring, relevance, completeness and factuality of the AI-generated medical notes.

Simulation review allows teams to test edge case scenarios that would be underrepresented in natural data sets. Examples the paper included were the diagnosis of a rare disease or a newly developed pharmaceutical product.

Duke proposes evaluation framework for AI scribes as VC dollars pour in

Related

Related