One well-known issue with large language models (LLMs) is their propensity to generate incorrect or nonsensical outputs, commonly termed as "hallucinations." While extensive research has examined these errors from the user perspective, a recent study delves into the internal mechanisms of LLMs, revealing that these models possess a deeper comprehension of truthfulness than previously assumed.
The term "hallucination" lacks a universally accepted definition and encompasses various types of errors in LLMs. In this study, the researchers adopt a broad interpretation, treating hallucinations as all the errors produced by LLMs, including factual inaccuracies, biases, commonsense reasoning failures, and other real-world mistakes.
Most previous research on hallucinations has focused on analyzing the external behaviors of LLMs and how users perceive these errors. However, these approaches offer limited insights into how errors are encoded and processed within the models internally.
Certain researchers have explored the internal representations of LLMs, suggesting that they encode signals related to truthfulness. However, previous studies primarily focused on examining the last token generated by the model or the final token in the prompt. Given that LLMs typically produce lengthy text responses, this approach may overlook critical details.
This new study adopts a different approach. Instead of solely focusing on the final output, the researchers analyze "exact answer tokens," which are tokens in the responses whose modification would alter the correctness of the answer.
The researchers conducted experiments on four different versions of Mistral 7B and Llama 2 models across ten datasets encompassing various tasks, including question answering, natural language inference, mathematical problem solving, and sentiment analysis. They permitted the models to generate unrestricted responses to simulate real-world usage. The experimental results indicate that truthful information is concentrated within the exact answer tokens.
To predict hallucinations, the researchers trained classifier models, referred to as "detection classifiers," which use the internal activations of LLMs to predict feature relevance to the truthfulness of generated outputs. They found that training the classifiers based on exact answer tokens significantly enhances error detection capabilities.
The researchers also examined whether detection classifiers trained on one dataset could detect errors in other datasets. They discovered that detection classifiers lack the ability to generalize across different tasks. Instead, they exhibit "skill-specific" truthfulness, meaning they can generalize within tasks requiring similar skills (such as fact retrieval or commonsense reasoning) but cannot extend to tasks requiring different skills (like sentiment analysis).
Further experiments demonstrate that these detection classifiers can not only predict the presence of errors but also anticipate the types of mistakes the models may make. This suggests that LLM representations contain specific information regarding their potential failure modes, aiding in the development of targeted mitigation strategies.
Finally, the researchers investigated how the internal truthfulness signals encoded in LLM activations align with their external behaviors. They found surprising inconsistencies in some cases: the model’s internal activations may correctly identify the right answer, yet it continues to generate incorrect responses.
This finding suggests that the current evaluation methods, which rely solely on the final output of LLMs, may not accurately reflect their true capabilities. It posits the possibility that by better understanding and leveraging the internal knowledge of LLMs, we could unlock their potential and significantly reduce errors.
The study's findings contribute to the design of more effective hallucination mitigation systems. However, the methods employed require access to the internal representations of LLMs, which is largely feasible only for open-source models.
Nonetheless, these findings have broader implications for the field. Insights gained from analyzing internal activations aid in developing more effective error detection and mitigation techniques. This work is part of a larger research domain aimed at better understanding the internal workings of LLMs and the billions of activations that occur during each reasoning step. Leading AI laboratories, including OpenAI, Anthropic, and Google's DeepMind, have been investigating various techniques to interpret the internal mechanisms of language models. These collective efforts hold the promise of constructing more reliable systems.