Can Chatbots Detect Side Effects of Mental Health Medications?

Summary: As gaps in mental healthcare persist, more people are turning to AI chatbots for guidance about psychiatric medication side effects. A new study from Georgia Institute of Technology assessed how well large language models (LLMs) detect and respond to these complex, high-risk situations, revealing important limitations in accuracy and actionability.

Although many AI systems can replicate the tone and empathy of a psychiatrist, researchers found they often struggle to identify adverse drug reactions correctly and to offer concrete, clinically useful steps. The study underscores the urgent need for safer, better-calibrated chatbots tailored to mental health care.

Key Facts:

  • Accuracy gaps: AI chatbots frequently misidentify psychiatric medication side effects or provide vague, non-actionable guidance.
  • Emotional tone versus clinical expertise: While LLMs can mimic a caring, professional tone, their clinical recommendations often fall short of expert standards.
  • High stakes: Relying on LLMs in mental health crises poses risks, especially for underserved populations with limited access to clinical care.

Source: Georgia Institute of Technology

Why people turn to AI

AI chatbots powered by large language models are available around the clock, often at low or no cost, and can synthesize large amounts of information. These features make them attractive options for people seeking quick answers about mental health concerns or medication side effects, particularly when access to clinicians is limited.

This shows a person using a laptop.
Chandra notes that improving AI for psychiatric and mental health concerns would be particularly life-changing for communities that lack access to mental healthcare. Credit: Neuroscience News

Recognizing this trend, researchers asked a pressing question: how do LLMs perform when users describe possible mental health emergencies or adverse reactions to psychiatric medications?

Globally, including in the United States, many people face barriers to timely mental healthcare. That reality has pushed more users to seek answers from online forums and AI assistants. To evaluate the safety and usefulness of these tools in high-risk contexts, Georgia Tech researchers developed a structured framework to measure LLM performance on medication side-effect detection and response quality.

The project was led by Munmun De Choudhury, J.Z. Liang Associate Professor in the School of Interactive Computing and a faculty member of the Georgia Tech Institute for People and Technology, together with Mohit Chandra, a Ph.D. student in computer science.

“People use AI chatbots for anything and everything,” said Chandra, the study’s first author. “When access to healthcare providers is limited, users increasingly rely on AI to interpret symptoms and decide what to do next. We wanted to see how these models handle the nuance and subjectivity of mental health scenarios.”

Putting AI to the Test

The researchers set out to answer two main questions: (1) Can LLMs accurately detect whether a user is describing medication side effects or adverse drug reactions? (2) If they can detect such situations, do LLMs offer effective, actionable harm-reduction strategies aligned with clinical practice?

To create a realistic dataset, the team collected user posts from Reddit—where many people discuss medication experiences—and worked with psychiatrists and psychiatry trainees to label clinical ground truth. The study evaluated nine LLMs, including general-purpose models such as GPT-4o and Llama-3.1, as well as specialized medical models trained on clinical data.

Using clinician-provided criteria, the researchers measured how precisely each model detected adverse reactions and how accurately it categorized different types of side effects. They also prompted LLMs to generate responses to real user queries and compared those replies to clinician answers across four dimensions: emotional tone, readability, proposed harm-reduction strategies, and actionability.

Overall, the team found that LLMs often misread the subtleties of adverse drug reactions and had trouble distinguishing between types of side effects. While model responses often sounded empathetic and professional, their clinical guidance generally lacked the precision and directiveness clinicians would recommend in high-risk cases.

Implications: Better Bots, Better Outcomes

These findings point to concrete opportunities to improve LLMs used in mental health contexts. Developers can use the study’s framework to refine models so their medical detection is more accurate and their recommended strategies are more actionable and personalized.

Chandra emphasizes the practical importance of improvement: for populations with little or no access to mental healthcare, well-designed AI tools could provide essential support. “Models that are always available, that can explain complex issues in users’ native languages, and that offer clear next steps could be invaluable,” he said. “But when AI gives incorrect information, the consequences can be serious. Studies like ours help reveal where LLMs fall short and where to focus improvements.”

Citation: Lived Experience Not Found: LLMs Struggle to Align with Experts on Addressing Adverse Drug Reactions from Psychiatric Medication Use (Chandra et al., NAACL 2025).

Funding: National Science Foundation (NSF), American Foundation for Suicide Prevention (AFSP), Microsoft Accelerate Foundation Models Research grant program. The findings, interpretations, and conclusions presented are those of the authors and do not represent the official views of NSF, AFSP, or Microsoft.

About this AI and psychopharmacology research news

Author: Catherine Barzler
Source: Georgia Institute of Technology
Contact: Catherine Barzler, Georgia Institute of Technology
Image: The image is credited to Neuroscience News

Original Research: Findings presented at the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL).