OpenAI pushes further into healthcare with release of HealthBench to evaluate AI models

OpenAI, the maker of ChatGPT, released an open-source benchmark designed to measure the performance and safety of large language models in healthcare.

The large data set, called HealthBench, goes beyond exam-style queries and tests how well artificial intelligence models perform in realistic health scenarios, based on what physician experts say matters most, the company said in a blog post Monday.

"Improving human health will be one of the defining impacts of AGI. If developed and deployed effectively, large language models have the potential to expand access to health information, support clinicians in delivering high-quality care, and help people advocate for their health and that of their communities," the company wrote in the post.

"Evaluations are essential to understanding how models perform in health settings," company executives said in the post. "Significant efforts have already been made across academia and industry, yet many existing evaluations do not reflect realistic scenarios, lack rigorous validation against expert medical opinion, or leave no room for state-of-the-art models to improve."

The HealthBench paper can be found here (PDF), and the code can be found here.

The evaluation framework was built in partnership with 262 physicians who have practiced in 60 countries, the company said.

HealthBench has 5,000 realistic health conversations baked in and grades model responses against physician-written rubrics, assessing safety, appropriateness and accuracy.

The conversations in HealthBench simulate interactions between AI models and individual users or clinicians, the company said, and they were produced via both synthetic generation and human adversarial testing. The conversations were "created to be realistic and similar to real-world use of large language models: they are multi-turn and multilingual, capture a range of layperson and healthcare provider personas, span a range of medical specialties and contexts, and were selected for difficulty," OpenAI said.

HealthBench evaluates 48,562 unique rubric criteria spanning several health contexts and behavioral dimensions, such as accuracy, instruction following and communication.

Model responses are evaluated by a model-based grader to assess whether each rubric criterion is met, and responses receive an overall score based on the total score of criteria met compared to the maximum possible score.

HealthBench conversations are split into seven themes, such as emergency situations, handling uncertainty or global health. Each theme contains its own rubric for grading.

Karan Singhal, who runs OpenAI's health AI team, said in a LinkedIn post that HealthBench was developed for two audiences: the AI research community to "shape shared standards and incentivize models that benefit humanity" and healthcare organizations to provide "high-quality evidence, towards a better understanding of current and future use cases and limitations."

OpenAI said HealthBench was developed to evaluate AI systems in health with several core principles in mind. For one, the scores should reflect real-world impact, the company said. "This should go beyond exam questions to capture complex, real-life scenarios and workflows that mirror the ways individuals and clinicians interact with models," OpenAI said in the post.

Evaluations should reflect the standards and priorities of healthcare professionals, providing a rigorous foundation for improving AI systems, the company said. And, benchmarks support progress. "Current models should show substantial room for improvement, offering model developers incentives to continuously improve performance," the company noted.

Ethan Goh, M.D., executive director of Stanford AI Research and Science Evaluation, told Fierce Healthcare that HealthBench is a step in the right direction toward advancing the evaluation of healthcare AI performance.

"Many prior benchmarks (e.g. MedQA, MultiMedQA, MedMCQA, USMLE) rely on multiple choice questions, often taken from doctor exams. These are now saturated and less useful for measuring improvement (ie, AI models are scoring close to 100%)," Goh wrote in a LinkedIn post. "HealthBench addresses this gap with a benchmark for task-level evaluation, covering patient and clinician use cases." 

"Many industry players were already using their models for various healthcare applications, and frankly not doing a great job with robust evaluation of AI responses, since in a rush to deploy a working prototype, which can be incredibly high stakes if for a consumer- or provider-facing use case," Goh said. "So this does help fill the gap somewhat."

Justin Norden, CEO and co-founder of Qualified Health, applauded the development of an open-source benchmark on performance of state-of-the-art models on medical-specific tasks.

“A huge barrier to adoption of these technologies in the medical setting is showing evidence that these tools can actually perform the task. Clinicians in the healthcare system are rightly skeptical of these tools and new technologies, as by design it’s a risk-averse field, and so it's really on the onus of the community to be able to show and track the performance of these tools,” said Norden, who serves as an Adjunct Professor at Stanford Medicine in the Department of Biomedical Informatics Research.

He added, “That's what's so exciting about having more companies contribute to this work and doing so in an open-source way so that others can verify it and use this as well. Zooming out looking at these fields we have, on the one hand, AI as the fastest-adopted technology in human history, where performance is continuing to improve at least on a month-by-month basis, and on the other hand, we have arguably one of the slowest-moving industries ever, especially with regards to technology, that has seen minimal adoption over the past two decades.”

OpenAI’s benchmarks and others are key factors to show the performance of AI tools and build trust, which will be essential to drive adoption, Norden noted.

“Right now, in healthcare, arguably we already have the technology now to fundamentally change what it looks like. The key barrier is going to be adoption, and it's great to see more tools like this come out to help make that easier for everyone as well,” he said.

Nigam Shah, Ph.D., chief data scientist at Stanford Health Care and a Professor of Medicine and Biomedical Data Science at Stanford School of Medicine, said HealthBench is “directionally aligned in spirit" with Stanford researchers’ work on MedHELM, a benchmark to test LLMs on real world clinical tasks.

“We definitely need to go beyond saturated QA benchmarks and this is definitely a step in the right direction,” Shah said via email.

“On the technical front, the fact that each model response is graded against a set of physician-written rubric criteria is a very good way to do such evaluations,” he said. “Having an open source dataset, with 5,000 conversations, further divided into hard and consensus as well as having the corresponding criteria defined is a great first step towards better evaluations of LLMs in healthcare settings. In addition, the profiling of the cost - performance trade off, and breaking down error-rates by theme is very informative,” Shah said.

HealthBench is “highly complementary” to the MedHELM work because it focuses on situations encountered outside of health systems. “We look forward to including this open source dataset in our work. The breadth is also amazing—262 physicians from 60 countries, spanning 49 languages and 26 medical specialties,” Shah noted.

Open-source frameworks such as MedHELM and others will make it feasible for multiple entities to share such benchmark datasets and for others to verify results, Shah noted. “Once that shared evaluation culture is established it creates a strong incentive to create shared benchmarking datasets,” he said.

OpenAI has been ramping up its partnerships with healthcare and life sciences organizations, but HealthBench marks its first healthcare AI application.

The company is working with Sanofi and Formation Bio to build an AI-powered tool designed to improve drug development by speeding up clinical trial recruitment. Iodine Software is working with the company to integrate generative AI and large language models, including GPT-4, across the breadth of its solutions for clinical administration and revenue cycle management.

Color Health also built gen AI tools with an AI-powered cancer co-pilot app, created in partnership with OpenAI. It's working with OpenAI to test out computer-generated personalized care plans for cancer patients. UTHealth Houston also partners with OpenAI to build and deploy algorithms for use in medical training and at the patient’s bedside.