Study Finds ChatGPT Nearly Undetectable in Medical Advice

Summary: A recent study from NYU finds that ChatGPT’s healthcare-related responses are often indistinguishable from those provided by human healthcare professionals, suggesting chatbots may play a useful role in patient-provider communication.

In a survey of 392 adults, participants reviewed a mix of responses written by human providers and generated by ChatGPT. Participants identified the source of each reply with roughly similar accuracy for both chatbot and human responses, while their trust in those replies varied according to the complexity of the health-related task.

Key Facts:

Participants correctly identified ChatGPT-generated healthcare responses 65.5% of the time and human provider responses 65.1% of the time.
Overall trust in ChatGPT’s answers averaged 3.4 out of 5. Trust was higher for logistical and preventive-care questions and lower for diagnostic or treatment guidance.
Researchers propose chatbots could assist with patient-provider communication—especially for administrative topics and chronic disease management—while urging caution before deploying chatbots for more clinical decisions.

Source: NYU

Study overview — Researchers from NYU Tandon School of Engineering and NYU Grossman School of Medicine conducted a controlled survey testing whether laypeople can distinguish ChatGPT’s medical advice from that of human clinicians. The findings indicate that many chatbot replies are difficult for patients to tell apart from provider-generated responses, highlighting both opportunities and limits for chatbot use in healthcare settings.

The research team presented 392 U.S. adults with ten representative patient questions drawn from electronic health records. For each question, respondents saw a single reply that had been produced either by a human provider or generated by ChatGPT. Respondents were told that five answers were from human providers and five were from a chatbot, and they were financially incentivized to identify the source correctly. Participants also rated how much they trusted the chatbot responses on a five-point scale ranging from completely untrustworthy to completely trustworthy.

Across questions, the ability to correctly classify replies varied considerably by item, with correct identification rates spanning roughly 49% to 85.7% depending on the question. On average, chatbot replies were identified correctly in 65.5% of instances (1,284 of 1,960), while human provider replies were identified correctly in 65.1% of instances (1,276 of 1,960). These results were consistent across demographic groups represented in the sample.

Respondents reported generally mild trust in chatbot-generated responses, with a mean trust score of 3.4 out of 5. Trust was highest for logistical or administrative questions—such as appointment scheduling and insurance guidance—averaging about 3.94. Preventive care topics, including vaccines and screening recommendations, received a moderate level of trust (average 3.52). Trust dropped substantially for clinical tasks: diagnostic and treatment advice scored lowest (approximately 2.90 and 2.89, respectively).

Based on these findings, the authors suggest chatbots may be well suited to support patient communication for low-risk tasks, like administrative requests and routine chronic disease management, where consistency and timely responses can improve care coordination. However, the researchers emphasize caution when considering chatbots for more complex clinical roles. Clinicians should exercise oversight, curate chatbot-generated content, and remain mindful of the limitations and potential biases inherent to current AI models.

About this ChatGPT AI research news

Author: Oded Nov
Source: NYU
Contact: Oded Nov – NYU
Image: The image is credited to Neuroscience News

Original Research: Closed access. “Putting ChatGPT’s Medical Advice to the (Turing) Test: Survey Study” by Oded Nov et al., published in JMIR Medical Education.

Abstract

Putting ChatGPT’s Medical Advice to the (Turing) Test: Survey Study

Background: Health systems are piloting chatbots to draft responses to patient questions, but the public’s ability to distinguish chatbot replies from human responses and the level of trust people place in these systems remain unclear.

Objective: The study evaluated whether ChatGPT or similar AI chatbots could feasibly support patient-provider communication and which functions patients might trust.

Methods: In January 2023, researchers selected ten non-administrative, representative patient-provider interactions from an electronic health record. Each patient question was entered into ChatGPT with instructions to match the approximate length of the original provider reply. In a web-based survey, each question was paired with either the human provider’s reply or the ChatGPT-generated reply. Participants were informed that half the replies were from providers and half were from a chatbot, and they were financially incentivized to identify each reply’s source. Participants also rated trust in chatbot functions on a 1–5 Likert scale.

Results: After data quality controls, 392 U.S. adults remained in the analytic sample (53.3% female; mean age 47.1 years, range 18–91). Correct classification of replies ranged from 49% to 85.7% across questions. On average, chatbot responses were correctly identified 65.5% of the time and provider responses 65.1% of the time. Trust scores for chatbot functions were mildly positive overall (mean 3.4/5), but trust declined as the clinical complexity of the task increased.

Conclusions: ChatGPT’s responses were only weakly distinguishable from provider replies, and laypeople demonstrated greater willingness to rely on chatbots for lower-risk, administrative, or routine healthcare tasks. Continued research is necessary as chatbots extend into more clinical roles, and healthcare providers should maintain oversight and critical judgment where AI-generated content supports patient care.