Summary: Researchers at Penn State have developed the Genetic Progression Score (GPS), an AI-driven method that integrates genetic studies and electronic health record (EHR) data to predict which individuals with preclinical autoimmune symptoms will progress to full disease. By combining genome-wide association study (GWAS) summary statistics with biobank data using a transfer-learning approach, GPS improves prediction accuracy substantially—ranging from roughly 25% to as much as 1,000% over existing models in tests—enabling earlier identification of high-risk patients and more targeted interventions.
This approach identifies at-risk individuals earlier in the disease pathway, offering opportunities for timely monitoring, preventative treatment, and more personalized disease management. The GPS framework can be adapted to study other underrepresented or small-sample disease populations, supporting advances in precision medicine and health equity.
Key Facts:
- GPS integrates large case-control GWAS summary statistics with EHR-based biobank data to improve progression prediction for autoimmune diseases.
- The model outperformed 20 other prediction strategies, delivering significantly higher accuracy when forecasting progression from preclinical stages to diagnosed disease.
- Early detection via GPS enables proactive monitoring, targeted therapies, improved clinical trial recruitment, and personalized care decisions.
Source: Penn State
Background: Autoimmune diseases occur when the immune system attacks healthy cells and tissues. Many conditions have a detectable preclinical phase—mild symptoms or circulating antibodies—before a formal diagnosis. In some people, early signs never progress; in others, they do. Identifying who will progress is vital for early diagnosis, timely intervention and preventing irreversible damage.

Led by Dajiang Liu and Bibo Jiang at the Penn State College of Medicine, the research team applied artificial intelligence to EHRs and large genetic datasets to develop GPS. Their method uses transfer learning to combine the strengths of two common data sources: GWAS case-control studies (which often have large sample sizes for specific diseases) and EHR-linked biobanks (which provide detailed clinical trajectories and can identify preclinical individuals). By integrating both, GPS refines polygenic risk information to focus explicitly on the transition from preclinical states to overt autoimmune disease.
Transfer learning allows GPS to borrow predictive patterns learned from large GWAS datasets and adapt them to the smaller, more clinically detailed biobank samples. This reduces reliance on large labeled samples of progression cases—often scarce for any single autoimmune disease—while retaining information that is most relevant to disease development in real-world patient populations.
In practice, GPS uses penalized regression to combine PRS (polygenic risk score) weights derived from case-control GWAS with biobank progression data. The model treats GWAS-derived weights as an informative prior and adjusts parameters when that prior improves prediction accuracy in the clinical cohort. This hybrid strategy improves stability and performance, especially when biobank sample sizes are limited or genetic correlations between progression and case-control phenotypes are modest.
The researchers evaluated GPS using data from the Vanderbilt University biobank (BioVU) to predict progression for rheumatoid arthritis and systemic lupus erythematosus, and validated results in the All of Us biobank. GPS outperformed 20 other modeling approaches, including methods that used only biobank data, only GWAS data, or simple combinations of both. For both diseases, GPS achieved the highest prediction R2 and the strongest correlation between PRS and progression prevalence.
Improved prediction has several clinical implications: earlier identification of high-risk individuals for close monitoring, selection of patients most likely to benefit from early therapeutic interventions, and better design and recruitment for clinical trials targeting disease prevention or slowing progression. Additionally, the transfer-learning approach can help study underrepresented patient groups that are small in traditional datasets, reducing disparities in genetic and clinical research.
“By targeting a population with family history or early symptoms, machine learning can identify those at highest risk and point toward therapeutics that may slow disease progression,” said Dajiang Liu. He emphasized that earlier detection is especially important because autoimmune damage can be irreversible once the disease advances. Approximately 8% of Americans live with autoimmune disease, with a higher prevalence among women.
The interdisciplinary team includes clinicians and geneticists who have collaborated on autoimmune research for nearly a decade. Co-first authors Chen Wang and Havell Markus contributed to study design and analysis alongside senior collaborators Laura Carrel, Galen Foulke, Nancy Olsen and others. Additional contributors came from Vanderbilt University School of Medicine and the University of Texas Southwestern Medical Center. Funding was provided by the National Institutes of Health, including the National Institute of Allergy and Infectious Diseases Office of Data Science and Emerging Technologies.
About this AI and neurology research news
Author: Christine Yu ([email protected])
Source: Penn State
Contact: Christine Yu – Penn State
Image: The image is credited to Neuroscience News
Original Research: Open access. “Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages” by Dajiang Liu et al., Nature Communications. DOI: 10.1038/s41467-024-55636-6
Abstract
Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages
Autoimmune diseases often show a preclinical stage before clinical diagnosis. EHR-based biobanks contain genetic and clinical data that can identify preclinical individuals at risk of progression. Because biobanks typically have limited case counts, constructing accurate polygenic risk scores (PRS) for progression is challenging. However, progression and case-control phenotypes can share genetic architecture, which can be leveraged to improve predictive performance.
The Genetic Progression Score (GPS) integrates case-control GWAS summary statistics and biobank progression data through a penalized regression framework. GPS treats GWAS-derived PRS weights as priors and adapts model parameters when the prior enhances prediction. Simulations and real-data analyses demonstrate that GPS outperforms strategies relying on biobank or case-control data alone and outperforms other methods that combine both sources. Improvements are most pronounced when biobank samples are small or genetic correlation between progression and case-control phenotypes is low. Applied to progression in rheumatoid arthritis and systemic lupus erythematosus, GPS achieved the highest prediction R2 and the strongest correlation between PRS and progression prevalence.