'Do we really want it just to pass?': Study finds ChatGPT fails gastroenterology training exam

ChatGPT failed to pass the 2021 and 2022 self-assessment tests for the American College of Gastroenterology, a new study found.

Published earlier this week in the American Journal of Gastroenterology, the study found the tool failed the two multiple-choice tests, which are a barometer for how one would do on the American Board of Internal Medicine Gastroenterology board exam. 

ChatGPT is a 175-billion-parameter natural language processing model that generates human-like text in response to user prompts. The tool is a large language model (LLM) trained to predict word sequences based on context. 

ChatGPT-3 and ChatGPT-4, two versions last updated in 2021, both scored below the 70% required to pass the exam. For ChatGPT to become a reliable and widely acceptable education tool, it should consistently provide more than 95% accuracy, the authors wrote.

The tests included hundreds of questions with real-time feedback on the correct answer. Each multiple-choice question was copied and pasted directly into ChatGPT and then a corresponding answer was selected on the online test based on the tool’s response. There was no pattern of question type that it answered incorrectly. 

The AI tool made waves in healthcare when a study found it passed the U.S. Medical Licensing Exam. Since then, some advocates have called for a need to use multi-disciplinary training incorporating AI to keep up with the changing medical landscape, the study noted. 

“There’s been an increased use of ChatGPT in every field, but in medicine, we noticed more and more people using it,” Arvind Trindade, M.D., associate professor at The Feinstein Institutes for Medical Research and senior author of the study, told Fierce Healthcare. Both trainees and patients have been seen using it, he said, so the authors wanted to put it to the test. 

“We actually thought it was going to do pretty well,” Trindade said. “When we looked at the final results we were a bit surprised.” 

ChatGPT was never specifically trained on medical literature, and was last updated in 2021, the study explained. Most of the data used to train the tool was sourced from open-source information. To answer certain questions correctly on the gastroenterology exam, access to paid journal subscriptions or databases may have been required. 

"Based on our research, ChatGPT should not be used for medical education in gastroenterology at this time and has a ways to go before it should be implemented into the health care field," Trindade said.

Especially tricky is the fact that ChatGPT responds with an affirmative answer even if it is wrong—so affirmative, that it might be mistaken for truth.

“You don’t want to learn the wrong information,” Trindade said. “This is far from optimized for medical usage.” 

A notable difference between the gastroenterology exam and the medical licensing exam is the latter has a lower threshold to pass. There may also be a lack of as much publicly available material needed to answer gastroenterology-specific questions accurately. 

Regardless, the tool didn’t score high on the other exams it squeaked by on. “Do we really want it just to pass?” Trindade said. 

Andrew Yacht, M.D., senior vice president, academic affairs and chief academic officer at Northwell Health, said the study is a "reminder that, at least for now, nothing beats hitting time-tested resources like books, journals and traditional studying to pass those all-important medical exams."

While medical schools can’t enforce where students get their medical information from, there are recommended sources, Trindade said. They include medical guidelines, journals and databases. If used for medical education, future versions of ChatGPT should be trained on the latest medical guidelines and actively updated.