Summary: A new AI-driven method detected long COVID in 22.8% of patients—substantially higher than prior estimates—by analyzing nearly 300,000 de-identified electronic health records. The algorithm separates symptoms specifically associated with SARS-CoV-2 infection from pre-existing health issues, helping clinicians identify post-acute sequelae of COVID-19 more accurately.
This approach, called “precision phenotyping,” systematically reviews longitudinal clinical data to distinguish long COVID symptoms from other conditions and may improve diagnostic accuracy by approximately 3% compared with relying solely on ICD-10 codes.
Key Facts:
- AI-based precision phenotyping: Uses longitudinal EHR data to flag post-COVID symptoms only after excluding other plausible medical explanations, improving cohort curation.
- Reduced bias and broader representation: The algorithm produces diagnoses that mirror Massachusetts demographics more closely than single-code approaches, addressing access-to-care bias in existing diagnostic codes.
- Research and clinical potential: The curated cohort can support future investigations into genetic, clinical, and biochemical differences across long COVID subtypes, as well as improve patient identification for care.
Source: Harvard / Mass General Brigham
Overview
New research from Mass General Brigham (MGB) used an AI-driven algorithm to analyze de-identified clinical records from nearly 300,000 patients across 14 hospitals and 20 community health centers in Massachusetts. The study found an estimated long COVID (post-acute sequelae of COVID-19, PASC) prevalence of 22.8%—considerably higher than earlier reports that suggested about 7%—by identifying patients whose persistent symptoms were associated with prior SARS-CoV-2 infection and not explained by pre-existing conditions.

Published as an open-access preprint, the study outlines how the precision phenotyping algorithm uses attention-based mechanisms to exclude symptoms explained by prior diagnoses, then tracks lingering signs across a 12-month follow-up window. For a symptom to qualify as long COVID in this framework, it had to persist for at least two months and be temporally associated with a documented COVID-19 infection while remaining unexplained by the patient’s prior medical history.
Common long COVID symptoms examined include prolonged fatigue, persistent cough, shortness of breath, and cognitive impairment often described as “brain fog.” The algorithm evaluates whether such symptoms are better explained by pre-existing conditions—such as heart failure or asthma in cases of breathlessness—before assigning a long COVID label. Only when alternative explanations are excluded does the tool flag the case as PASC.
“Our AI tool could turn a foggy diagnostic process into something sharp and focused, giving clinicians the power to make sense of a challenging condition,” said Hossein Estiri, senior author and head of AI Research at the Center for AI and Biomedical Informatics of the Learning Healthcare System (CAIBILS) at MGB, and an associate professor of medicine at Harvard Medical School. “With this work, we may finally be able to see long COVID for what it truly is — and more importantly, how to treat it.”
Co-lead author Alaleh Azhir, an internal medicine resident at Brigham and Women’s Hospital, emphasized the clinical value of a methodical, AI-supported review of complex histories: “Physicians often face tangled webs of symptoms and busy caseloads. A tool that can systematically evaluate records and point to probable PASC cases could be a game-changer.”
Performance and implications
Compared with cases identified solely by the ICD-10 U09.9 diagnostic code, the precision phenotyping algorithm identified a research cohort exceeding 24,000 patients versus about 6,000 identified by the code alone. Independent chart reviews used for validation showed the AI method achieved about 79.9% precision, slightly higher than the 77.8% precision measured for the ICD-10 code. The researchers report that the algorithm reduces bias tied to healthcare access and yields prevalence estimates consistent with regional data.
Beyond improving detection, the curated PASC cohort can enable deeper study of long COVID’s clinical features, organ-specific effects, comorbidity patterns, and temporal risk dynamics. The team plans to release the algorithm openly so healthcare systems worldwide can apply it to their patient populations and refine identification of PASC subgroups for targeted research.
Limitations
The authors note several limitations. EHR data may omit details that clinicians capture in narrative notes after visits, which could affect symptom attribution. The algorithm’s exclusion rules may also omit cases where an existing condition worsened as a consequence of COVID-19—episodes that might in some cases represent long COVID. Additionally, lower rates of COVID-19 testing in later phases of the pandemic make it harder to establish precise infection timing. The study cohort was limited geographically to patients in Massachusetts.
Next steps
Future research will test the algorithm in patients with specific chronic conditions (for example, COPD or diabetes) and further examine its performance in diverse settings. The curated cohort will support studies into genetic, metabolomic, and clinical determinants of long COVID subtypes, helping to address gaps left by smaller or biased cohorts.
“Questions about the true burden of long COVID—questions that have thus far remained elusive—now seem more within reach,” said Estiri.
Funding: This work was supported by the National Institutes of Health: National Institute of Allergy and Infectious Diseases (NIAID) R01AI165535; National Heart, Lung, and Blood Institute (NHLBI) OT2HL161847; and National Center for Advancing Translational Sciences (NCATS) UL1 TR003167, UL1 TR001881, and U24TR004111. J. Hügel’s work received partial support from a fellowship within the IFI program of the German Academic Exchange Service (DAAD), the Federal Ministry of Education and Research (BMBF), and the German Research Foundation (426671079).
About this AI and long COVID research news
Author: MGB Communications
Source: Harvard (Mass General Brigham)
Contact: MGB Communications – Harvard
Image: The image is credited to Neuroscience News
Original Research: Open access. “Precision Phenotyping for Curating Research Cohorts of Patients with Post-Acute Sequelae of COVID-19 (PASC) as a Diagnosis of Exclusion” by Hossein Estiri et al., MedRxiv. The study presents the development and validation of an attention-based precision phenotyping algorithm applied to longitudinal EHR data from over 295,000 patients to improve detection, reduce bias, and refine prevalence estimates for PASC.
Abstract
Precision Phenotyping for Curating Research Cohorts of Patients with Post-Acute Sequelae of COVID-19 (PASC) as a Diagnosis of Exclusion
Scalable identification of patients with PASC is hampered by the lack of reproducible precision phenotyping methods and by limitations of the ICD-10 U09.9 code, which underestimates prevalence and introduces demographic bias. In a retrospective case-control design, this study developed an attention-based precision phenotyping algorithm that excludes sequelae better explained by prior conditions and validates performance with independent chart review. Applied to longitudinal EHR data from over 295,000 patients across hospitals and community health centers in Massachusetts, the algorithm identified more than 24,000 PASC cases with improved precision and an estimated prevalence of 22.8%. The curated cohort offers a foundation for future investigations into long COVID’s clinical, genetic, and biochemical heterogeneity.