How to Boost AI Answer Accuracy and Confidence

Summary: Researchers from the Japan Advanced Institute of Science and Technology have introduced Answer-prefix Generation (ANSPRE), a prompting technique designed to improve large language models (LLMs) for open-domain question answering (ODQA). ANSPRE helps models produce concise answer phrases and delivers better-calibrated confidence scores, addressing critical needs for reliable automated answers in high-stakes domains such as healthcare, law, and finance.

By appending a targeted “answer prefix” to the input prompt, ANSPRE directs an LLM to generate the exact answer phrase rather than a long, contextual response. Evaluations on multiple ODQA benchmarks show that ANSPRE substantially improves answer quality and the usefulness of model-generated confidence estimates, making LLMs more practical for real-world applications.

Key facts:

ANSPRE guides LLMs to produce concise, exact answer phrases and more reliable confidence scores.
The method uses a crafted answer prefix added to the prompt to steer generation toward the target phrase.
ANSPRE improves performance across pre-trained and instruction-tuned LLMs and is relevant to sensitive fields like healthcare, law, and education.

Source: Japan Advanced Institute of Science and Technology

Background: Large language models use massive datasets and causal language modeling to generate fluent text, and they have shown significant promise for open-domain question answering. Despite strong capabilities, LLMs often rely on static pre-training knowledge, which can become outdated. To reduce hallucinations and update factuality, many systems use Retrieval-Augmented Generation (RAG), where a retriever supplies relevant documents from an external knowledge base before the model generates an answer.

This shows computer code. — Another important aspect of LLMs is their ability to produce confidence scores, which reflect how certain the model is about the correctness of its answer. Credit: Neuroscience News

Even with retrieval, LLMs frequently return long answers that mix the precise answer phrase with explanatory context. This can make it harder for users to identify the exact factual response and complicates automated evaluation. Equally important is the model’s confidence estimate: although LLMs can assign probabilities to generated sequences, these probabilities are often poorly calibrated and may not reflect the true likelihood that an answer is correct. Poorly calibrated confidence scores limit trust and prevent safe deployment in risk-sensitive settings.

To address these challenges, the research team led by Professor Nguyen Le Minh, with doctoral students Nguyen-Khang Le and Dieu-Hien Nguyen, developed Answer-prefix Generation (ANSPRE). ANSPRE is a lightweight prompting method that can be applied to any LLM architecture and integrated with retrieval components.

The core idea is simple: prepend an answer-leading phrase—the answer prefix—to the prompt so the model is encouraged to fill in a short, exact answer phrase. For example, given the question “What gambling game, requiring two coins to play, was popular in World War I?” an answer prefix could read: “The gambling game requiring two coins to play that was popular in World War I was ___.” Because many LLMs are trained using causal language modeling, this formulation nudges the model to output just the missing phrase in place of the blank.

ANSPRE generates an answer prefix for each question using a small set of few-shot examples. The researchers found that only a handful of carefully chosen examples were sufficient to produce effective answer prefixes. After generating the prefix, ANSPRE uses an existing retriever to fetch candidate documents from the knowledge base. The question, retrieved documents, and answer prefix are combined into a single prompt that the LLM uses to generate a concise answer phrase.

To produce a final response, ANSPRE aggregates answer phrases and computes confidence scores across multiple retrieved documents. This aggregation yields a ranked answer list and a more reliable confidence estimate that better correlates with correctness than raw sequence probabilities.

The team extended ANSPRE into Self-Reflective Answer-Prefix Generation (SELF-ANSPRE) by combining it with Self-Reflective RAG (SEFT-RAG). SEFT-RAG adds reflection tokens and retrieval decisions to the generation process; SELF-ANSPRE merges reflection-derived scores with ANSPRE’s confidence estimates to further improve retrieval ranking and answer selection.

Experimental results across three ODQA benchmarks and multiple LLM architectures showed that ANSPRE consistently improves answer precision and the calibration of confidence scores. SELF-ANSPRE further enhanced the performance of SEFT-RAG, demonstrating the value of combining concise answer prompting with reflective retrieval strategies. The researchers’ analysis also clarified the contribution of each ANSPRE component to the overall gains.

Professor Nguyen notes that ANSPRE’s ability to produce concise, accurate answers with trustworthy confidence scores could expand the safe use of LLMs in medical diagnosis aid, legal research, educational tutoring, and customer support. By improving the transparency and reliability of automated answers, the method may also help increase user trust and foster more effective collaboration between humans and AI.

Overall, ANSPRE offers a practical and adaptable approach to improving LLM output quality and confidence estimation in open-domain question answering, marking a meaningful step toward deploying LLMs in sensitive and real-world applications.

About this LLM and AI research news

Author: Nguyen Le Minh
Source: Japan Advanced Institute of Science and Technology
Contact: Nguyen Le Minh – Japan Advanced Institute of Science and Technology
Image: The image is credited to Neuroscience News

Original research: The findings will be presented at ECAI-2024, the 27th European Conference on Artificial Intelligence, held on October 19–24.