Anthropic has published two groundbreaking papers that delve into the inner workings of large language models. The research explores how to identify interpretable concepts and link them to computational "circuits" that translate these concepts into language. Additionally, it examines key behaviors of Claude Haiku 3.5, including hallucinations, planning, and other crucial attributes.
The internal mechanisms behind the capabilities of large language models remain largely enigmatic, making it challenging to interpret or comprehend the strategies they use to solve problems. These strategies are embedded within the billions of calculations that underpin each word generated by the model. However, according to Anthropic, they are still mostly opaque. To investigate this hidden layer of reasoning, Anthropic researchers have developed a novel approach called the "AI Microscope":
Inspired by neuroscience, a field that has long studied the intricate internals of thinking organisms, we aimed to create an AI microscope that allows us to identify activity patterns and information flow.
Put simply, Anthropic's AI Microscope involves replacing the model under study with what is called a substitute model, where neurons in the model are replaced by sparsely activated features that often represent interpretable concepts. For instance, a specific feature might activate when the model is about to generate the capital of a state.
Naturally, substitute models don’t always produce the same outputs as the base model. To address this limitation, Anthropic researchers created a local substitute model for each prompt they wanted to study, incorporating error terms and fixed attention patterns into the substitute model.
[The local substitute model] produces exactly the same output as the original model while replacing as many computations as possible with features.
As a final step in describing the flow of features within the local substitute model from the initial prompt to the final output, researchers constructed an attribution graph. This graph is built by pruning away all features that do not affect the output.
Keep in mind, this is only a very high-level overview of Anthropic’s AI Microscope. For more details, please refer to the original papers linked above.
Using this method, Anthropic researchers uncovered some intriguing findings. Regarding multilingual capabilities, they discovered evidence of a universal language that Claude uses to generate concepts before translating them into specific languages.
We investigated this by asking Claude "What is the opposite of 'small'?" in different languages and found that core features related to "small" and "opposite" were activated, triggering the concept of "big," which was then translated into the language of the question.
Another fascinating discovery challenges the common belief that LLMs generate outputs "word by word without much forethought." Studying how Claude generates rhymes revealed that it actually plans ahead:
Before starting the second line, it begins to "think" about relevant words that rhyme with "grab it." Then, based on these plans, it writes a line ending with the planned word.
Anthropic researchers also explored why models sometimes fabricate information, known as hallucinations. Hallucinations are, in a sense, an intrinsic characteristic of how models operate because they are designed to always make a next-best guess. This means models rely on specific anti-hallucination training to counteract this tendency. In essence, two distinct mechanisms are at play: one identifies "known entities," while the other corresponds to "unknown names" or "unanswerable questions." Their proper interaction is crucial for preventing the model from hallucinating:
We demonstrated how misfiring can occur when Claude recognizes a name but knows nothing about the person. In such cases, the "known entity" feature might still activate, suppressing the default "don't know" feature—incorrectly so. Once the model decides it needs to answer, it starts fabricating: generating a plausible but unfortunately false response.
Other interesting dimensions explored by Anthropic researchers include mental arithmetic, generating chains of thought that explain reasoning processes, multi-step reasoning, and jailbreaking techniques. Full details are available in Anthropic’s papers.
Anthropic’s AI Microscope aims to contribute to interpretability research and ultimately provide a tool to help us understand how models reason and ensure they align with human values. However, this remains a preliminary effort, capturing only a small portion of the total computation of the model and applicable only to small prompts containing dozens of words. InfoQ will continue to cover advancements in LLM interpretability as new insights emerge.