Study: ChatGPT Matches Radiologists in Brain Tumor Diagnosis

Summary: Researchers evaluated the diagnostic performance of GPT-4–based ChatGPT against practicing radiologists using 150 preoperative brain tumor MRI reports. ChatGPT reached 73% accuracy for final diagnoses, narrowly surpassing neuroradiologists (72% average) and general radiologists (68% average). The model performed best (80% accuracy) when interpreting reports written by neuroradiologists, indicating its promise as a diagnostic support tool in radiology.

As artificial intelligence continues to advance, large language models like GPT-4 are showing increasing utility in medical imaging interpretation. This study highlights how an AI assistant can offer accurate differential and final diagnoses from routine radiology reports, potentially reducing clinician workload and serving as a second opinion in complex brain tumor cases.

Key facts:

  • Overall final-diagnosis accuracy: ChatGPT 73% vs. neuroradiologists 72% (average) and general radiologists 68% (average).
  • Higher accuracy (80%) when using reports authored by neuroradiologists; lower accuracy (60%) with reports from general radiologists.
  • GPT-4 maintained a strong differential-diagnosis performance (94% accuracy), outperforming radiologists, who ranged from 73% to 89%.
This shows brain scans.
Final diagnosis accuracy: ChatGPT 73%, neuroradiologists 72% (average), general radiologists 68% (average). Image credit: Neuroscience News

The study, led by graduate student Yasuhito Mitsuyama and Associate Professor Daiju Ueda from Osaka Metropolitan University’s Graduate School of Medicine, used real-world clinical MRI reports produced in Japanese from January 2017 to December 2021. Radiologists translated the reports into English for analysis. The research compared GPT-4–based ChatGPT’s outputs with those from two board-certified neuroradiologists and three general radiologists, asking each to list differential diagnoses and provide a single final diagnosis based solely on the textual MRI findings.

Pathological examination of the excised tumors provided the ground truth for accuracy assessment. Statistical analysis employed McNemar’s test and Fisher’s exact test to compare diagnostic performance. The results showed that GPT-4’s final-diagnosis accuracy (73%) was comparable to that of expert neuroradiologists and higher than the average general radiologist in this dataset.

A notable finding is the dependence of ChatGPT’s final-diagnosis performance on the quality and detail of the input report. When interpreting reports written by neuroradiologists, ChatGPT reached 80% accuracy, while reports written by general radiologists yielded only 60% accuracy. This suggests that AI performance benefits from clearer, more specialized reporting and that AI tools and reporting practices could be co-optimized to improve outcomes.

In differential diagnosis tasks—where multiple possible tumor types are listed—ChatGPT achieved an accuracy of 94%, outperforming radiologists, whose differential accuracies ranged from 73% to 89%. For differential lists, GPT-4’s performance was stable regardless of whether the original report came from a neuroradiologist or a general radiologist.

Graduate student Mitsuyama commented that the findings support ChatGPT’s usefulness for preoperative MRI diagnosis of brain tumors. The research team plans to expand evaluation of large language models into other diagnostic imaging areas, aiming to reduce clinician workload, improve diagnostic accuracy, and incorporate AI into medical education and training.

About this AI and brain tumor imaging study

Author: Yung-Hsiang Kao
Source: Osaka Metropolitan University
Contact: Yung-Hsiang Kao – Osaka Metropolitan University
Image: The image is credited to Neuroscience News

Original research: Open access. “Comparative analysis of GPT-4-based ChatGPT’s diagnostic performance with radiologists using real-world radiology reports of brain tumors” by Yasuhito Mitsuyama et al., published in European Radiology.


Abstract

Comparative analysis of GPT-4-based ChatGPT’s diagnostic performance with radiologists using real-world radiology reports of brain tumors

Objectives

Large language models such as GPT-4 have demonstrated diagnostic potential in radiology, but prior evaluations often relied on textbook-style quizzes or curated cases. This study aimed to measure GPT-4–based ChatGPT’s real-world diagnostic capability using routine clinical MRI reports for brain tumors and to directly compare that performance with neuroradiologists and general radiologists.

Methods

Researchers collected preoperative brain MRI reports written in Japanese from two institutions (2017–2021). Radiologists translated the reports into English. The same textual findings were shown to GPT-4 and to five radiologists, who provided differential diagnoses and a single final diagnosis for each case. The pathological diagnosis after tumor resection served as the reference standard. Statistical comparisons used McNemar’s test and Fisher’s exact test.

Results

Across 150 reports, GPT-4 achieved a 73% final-diagnosis accuracy, while radiologists’ final-diagnosis accuracy ranged from 65% to 79%. Performance varied by report authorship: GPT-4 reached 80% accuracy on neuroradiologist-authored reports and 60% on general radiologist-authored reports. For differential diagnoses, GPT-4’s accuracy was 94%, compared with 73%–89% for radiologists. Importantly, GPT-4’s differential-diagnosis accuracy was consistent regardless of the report’s author specialty.

Conclusion

GPT-4 demonstrated diagnostic performance comparable to neuroradiologists when interpreting clinical MRI reports of brain tumors. The model can serve as a reliable second opinion for neuroradiologists and a practical guidance tool for general radiologists and trainees.

Clinical relevance statement

This study shows that a GPT-4–based large language model can interpret real-world MRI findings for brain tumors with accuracy competitive with radiologists, supporting its potential role as an adjunct diagnostic aid in clinical radiology practice.