Scaling Brain Imaging Models for Real-World Clinical Use

Summary: New research from Yale demonstrates that predictive models connecting brain activity to behavior must generalize across diverse datasets to be clinically useful. By training models on varied neuroimaging datasets, the team showed that properly designed models can retain predictive accuracy even when applied to datasets with different demographic, geographic, and clinical profiles.

This work highlights the importance of developing neuroimaging models that perform reliably across diverse populations, including those in underserved rural communities, to help ensure equitable access to future diagnostic and treatment tools.

The study emphasizes that intentionally testing models on heterogeneous data is critical for achieving robust, clinically relevant predictions from brain imaging. Expanding model generalization will strengthen the role of neuroimaging in personalized mental health care and neurological assessment.

Key Facts:

  • Predictive models trained on neuroimaging data showed strong performance across multiple, varied datasets, indicating potential for generalizability.
  • External testing on different datasets is essential for translating neuroimaging models into clinically meaningful tools.
  • Including diverse participant representation in neuroimaging datasets supports the development of fairer mental health diagnostics and treatments.

Source: Yale

Relating patterns of brain activity to behavior is a central goal of neuroimaging research because it can improve understanding of how the brain produces behavior and pave the way for personalized treatments for mental health and neurological conditions.

Researchers often use brain scans and behavioral measurements to train machine learning models that predict symptoms or cognitive traits from brain function. However, a major limitation is that many models only perform well on data similar to what they were trained on; they frequently fail when applied to datasets with different characteristics.

In a recent study, Brendan Adkinson and colleagues at Yale tested whether predictive models could remain accurate when evaluated on datasets that differed substantially from the training data. Their findings, published in Developmental Cognitive Neuroscience, show that models can indeed generalize across real-world dataset shifts when trained and validated with attention to diversity.

This shows a brain.
Three models were trained — one on each dataset — and then each model was tested on the other two datasets. Credit: Neuroscience News

“It is common for predictive models to perform well when tested on data similar to what they were trained on,” said Brendan Adkinson, lead author of the study. “But when you test them in a dataset with different characteristics, they often fail, which makes them virtually useless for most real-world applications.”

The root of this problem is variability across datasets. Differences in age ranges, sex, race and ethnicity, geographic recruitment, clinical symptom profiles, imaging tasks and sequences, and behavioral measures can all create what researchers call dataset shifts. These shifts make it difficult for a model trained in one context to succeed in another.

Rather than treating these differences as obstacles, Adkinson and colleagues argue they should be central to model design and evaluation. “Predictive models will only be clinically valuable if they can predict effectively on top of these dataset-specific idiosyncrasies,” he said. Adkinson is an M.D.-Ph.D. candidate in the lab of senior author Dustin Scheinost, associate professor of radiology and biomedical imaging at Yale School of Medicine.

To probe model generalization, the team trained models to predict two cognitive traits—language ability and executive function—using three large, unharmonized developmental datasets that differ markedly from one another. They trained three separate models (one per dataset) and then evaluated each model on the two other datasets to test cross-dataset performance.

Despite substantial differences between the datasets, the models achieved solid predictive performance by contemporary neuroimaging standards. “That tells us that generalizable models are achievable and that testing across diverse dataset features can help identify models that will perform in real-world settings,” Adkinson said.

The researchers also observed that, for some dataset pairs, models trained on one dataset actually predicted another dataset better than models trained and tested within the same dataset using cross-validation. This unexpected result suggests that training on diverse sources may sometimes improve prediction for specific target populations.

A particular concern raised by the team is the overrepresentation of metropolitan populations in large-scale neuroimaging collections. Most large datasets are gathered where recruitment is easier—typically urban and suburban centers—which can bias models and limit their applicability to rural communities or other underrepresented groups.

“If models become strong enough to inform clinical assessment and treatment but don’t generalize to rural populations, those communities risk being underserved,” Adkinson said, noting that he himself comes from a rural background. The team is now exploring approaches to make predictive models more robust for specific populations, including rural residents.

About this AI and neuroimaging research news

Author: Mallory Locklear
Source: Yale
Contact: Mallory Locklear – Yale
Image: The image is credited to Neuroscience News

Original Research: Open access.
“Brain-phenotype predictions of language and executive function can survive across diverse real-world data: Dataset shifts in developmental populations” by Brendan Adkinson et al. Developmental Cognitive Neuroscience


Abstract

Brain-phenotype predictions of language and executive function can survive across diverse real-world data: Dataset shifts in developmental populations

Predictive modeling offers a route to more reproducible and generalizable brain–behavior associations in neuroimaging. However, external validation—evaluating a model on independent datasets—is still underused. When it is performed, few studies systematically address how models handle dataset-specific idiosyncrasies, or dataset shifts, that occur across different research sites and collection practices.

This study rigorously tested a range of predictive approaches across three large, unharmonized developmental samples: the Philadelphia Neurodevelopmental Cohort (n=1291), the Healthy Brain Network (n=1110), and the Human Connectome Project in Development (n=428). These datasets differ substantially in age distribution, sex and racial/ethnic composition, recruitment geography, clinical symptom burden, task designs, imaging sequences, and behavioral assessments.

Using advanced modeling and evaluation methods, the authors demonstrate that reproducible and generalizable brain–behavior relationships can be achieved despite pronounced inter-dataset variability. Functional connectome–based predictive models showed robustness across diverse dataset features, and in some cases predictions improved when training used data from a different dataset rather than relying solely on within-dataset cross-validation.

Overall, the findings provide a strong foundation for future work focused on validating brain–phenotype associations in real-world and clinical scenarios, and they underscore the importance of testing models on diverse, representative data to ensure broad applicability and fairness.