There have been plenty of headlines in recent years about studies suggesting artificial intelligence can outperform clinicians when it comes to reading X-rays or CT scans.
But a new review, published Wednesday in the BMJ, suggests those many of those studies have been flawed and overexaggerate those claims.
Led by experts at The Alan Turing Institute in London, researchers examined a sample of more than 7,000 studies that compared the performance of AI against clinicians to assess the methods, risk of bias and adherence to reporting standards.
In that sample, researchers say more than two-thirds of the studies had design issues that increased the risk of biased results. They also found a tendency to overstate the actual results in abstracts, the authors wrote in the review.
From the pool of 7,335 study records, researchers found only two randomized clinical trials that met their eligibility criteria.
More worrisome, researchers found relatively few studies assessing the effectiveness of deep learning, which became a mainstream technology in 2014. That means big technology companies have continued to push solutions toward market without much in the way of clinical evidence to guide their development.
“We found only one randomized trial registered in the US despite at least 16 deep learning algorithms for medical imaging approved for marketing by the Food and Drug Administration,” the authors wrote.
Of the 81 non-randomized studies that met researchers’ criteria, only nine were prospective and only six involved real-world testing. Despite the higher likelihood of bias and confounding, the authors point out that retrospective studies typically get cited in FDA approval notices for AI algorithms.
“Ensuring fair comparison between AI and clinicians is arguably done best in a randomized clinical (or at the very least prospective) setting,” the authors write.
Researchers also noted issues with descriptions and study designs. Many studies provided only vague descriptions of the hardware they deployed, making it difficult to reproduce results.
The authors also found few studies used more than a handful of expert clinicians in their human comparison group. They suggest the median of four experts leaves results prone to bias toward AI algorithms, as including nonexpert clinicians could decrease the average performance of humans compared to AI algorithms.
Finally, a handful of studies suggested comparable or superior diagnostic performance from AI compared to a clinician in their abstracts that were not necessarily supported by their results. None of the 23 studies that claimed superior performance by AI algorithms made any mention of the need for further prospective testing to reinforce their findings.
To improve the quality of future studies, the authors recommend developing and following a set of best practices to improve design and transparency. A failure to do so, they warn, could ultimately hurt patients.
“Overpromising language could mean that some studies might inadvertently mislead the media and the public, and potentially lead to the provision of inappropriate care that does not align with patients’ best interests,” they write.