Why AI Chatbots Overestimate Their Abilities and Don't Know It

Summary: A new study from Carnegie Mellon University finds that many AI chatbots overestimate their own abilities and often fail to adjust their confidence after poor performance. Comparing humans and large language models (LLMs) across trivia, prediction, and image-recognition tasks, researchers show that while people recalibrate their expectations after errors, several popular LLMs tend to become even more overconfident.

The research highlights a striking lack of metacognitive awareness in current AI systems: one model, Gemini, performed the worst on a Pictionary-style task yet rated its own performance the highest. These results underline why users should treat AI confidence with skepticism and why developers should prioritize improving self-assessment in LLMs.

Key Facts:

Overconfidence: AI chatbots frequently overstate their accuracy, even when they are wrong.
No Self-Awareness: Unlike humans, many LLMs fail to lower their self-assessed confidence after poor performance.
Varied Results: Models differ in calibration; some, like ChatGPT, align more closely with human judgments, while others, such as Gemini in this study, showed severe miscalibration.

Source: Carnegie Mellon University

Overview

AI chatbots are widespread — integrated into apps, customer support, and search tools — but their confident delivery can mask important limitations. To examine how well these systems understand their own uncertainty, researchers compared confidence judgments made by human participants and four LLMs—ChatGPT, Bard/Gemini, Sonnet, and Haiku—across several tasks: trivia questions, predictions about NFL games and the Academy Awards, and a Pictionary-like image identification challenge.

This shows a computer monitor with 100% written on it. — The researchers note that it’s also possible that the chatbots could develop a better understanding of their own abilities over vastly larger data sets. Credit: Neuroscience News

Both people and LLMs tended to overestimate how they would perform when asked prospectively. Their actual accuracy rates were often similar, but the key difference emerged when participants were asked retrospectively — to judge how well they had done after the task was complete.

Human participants adjusted their confidence downward after performing worse than expected. As Trent Cash, lead author and recent joint Ph.D. graduate from Carnegie Mellon’s Social Decision Science and Psychology departments, explained: “Say the people told us they were going to get 18 questions right, and they ended up getting 15. Typically, their estimate afterwards would be something like 16 correct answers. So they’d still be a little bit overconfident, but not as overconfident.”

By contrast, Cash and his colleagues found that LLMs generally failed to make this adjustment. “The LLMs did not do that,” he said. “They tended, if anything, to get more overconfident, even when they didn’t do so well on the task.”

Danny Oppenheimer, coauthor and professor in CMU’s Department of Social and Decision Sciences, noted that humans have lifelong experience reading confidence cues from other people — tone, hesitation, facial expressions — whereas chatbots present few such signals. “When an AI says something that seems a bit fishy, users may not be as skeptical as they should be because the AI asserts the answer with confidence, even when that confidence is unwarranted,” he said.

What the Tests Showed

The study’s multi-year design used updated versions of the four LLMs over two years, demonstrating that the overconfidence pattern persisted across model iterations. In specific tasks, ChatGPT-4 performed relatively close to human calibration and identified an average of 12.5 hand-drawn images correctly out of 20 in the Pictionary-like trial. Sonnet showed comparatively better calibration overall, while Gemini performed poorly: it identified fewer than one sketch correctly on average but retrospectively estimated that it had answered more than 14 sketches correctly, a stark sign of miscalibration.

These results suggest that models differ in their strengths and weaknesses, and no single system is uniformly reliable at assessing its own uncertainty.

Implications for Everyday Use

The findings have practical importance. While trivia and sports predictions are relatively low-stakes, similar miscalibration could be consequential in areas such as news summarization, legal information, or medical advice. Past research has shown that LLMs can produce factual errors, misattribute sources, or hallucinate details in a substantial share of responses, underscoring the need for caution.

For typical users, the study’s main takeaway is simple: treat confident-sounding answers from chatbots with healthy skepticism and, when possible, ask the system to report its level of confidence. Although LLMs may not always accurately judge their own uncertainty, an explicit low-confidence signal from a model should be treated as a warning rather than reassurance.

Looking Forward

The authors acknowledge that LLMs could improve with further training or different architectures. Oppenheimer suggested that with far larger datasets or additional mechanisms for self-reflection, models might eventually learn to better calibrate their confidence. “Maybe if it had thousands or millions of trials, it would do better,” he said. Cash added that if LLMs could detect when they were wrong and update themselves recursively, many of the current problems would be alleviated.

Ultimately, documenting and understanding overconfidence in LLMs can guide developers to build more reliable systems and help users make better judgments when relying on AI outputs. The contrast between human and machine metacognition also points to a deeper question about how people learn and communicate compared with statistical models.

About this LLM and AI research news

Author: Abby Simmons
Source: Carnegie Mellon University
Contact: Abby Simmons – Carnegie Mellon University
Image: The image is credited to Neuroscience News

Original Research: Open access. “Quantifying Uncert-AI-nty: Testing the Accuracy of LLMs’ Confidence Judgments” by Trent Cash et al., Memory & Cognition. DOI: 10.3758/s13421-025-01755-4

Abstract

Quantifying Uncert-AI-nty: Testing the Accuracy of LLMs’ Confidence Judgments

Large Language Models like ChatGPT and Gemini have transformed access to information and can respond to many kinds of queries. Yet while humans naturally attach metacognitive confidence judgments to their answers, it has been unclear how accurately LLMs quantify their own uncertainty. This set of studies compares the absolute and relative accuracy of confidence judgments from four LLMs (ChatGPT, Bard/Gemini, Sonnet, Haiku) with human participants across tasks involving aleatory uncertainty (NFL and Oscar predictions) and epistemic uncertainty (Pictionary performance, trivia, and questions about university life).

The research finds parallels between humans and LLMs — both tend to be overconfident, and LLMs sometimes slightly outperform humans on certain metacognitive metrics — but crucially, LLMs, particularly ChatGPT and Gemini in these experiments, often fail to adjust confidence based on past performance. This limitation highlights an important metacognitive gap for current AI systems.