Anthropic's "AI Microscope" Explores the Inner Workings of Large Language Models AI NEWS

Home
AInews
Anthropic's "AI Microscope" Explores the Inner Workings of Large Language Models

Anthropic's "AI Microscope" Explores the Inner Workings of Large Language Models

2025-04-13

Anthropic has published two groundbreaking papers that delve into the inner workings of large language models. The research explores how to identify interpretable concepts and link them to computational "circuits" that translate these concepts into language. Additionally, it examines key behaviors of Claude Haiku 3.5, including hallucinations, planning, and other crucial attributes.

The internal mechanisms behind the capabilities of large language models remain largely enigmatic, making it challenging to interpret or comprehend the strategies they use to solve problems. These strategies are embedded within the billions of calculations that underpin each word generated by the model. However, according to Anthropic, they are still mostly opaque. To investigate this hidden layer of reasoning, Anthropic researchers have developed a novel approach called the "AI Microscope":

Inspired by neuroscience, a field that has long studied the intricate internals of thinking organisms, we aimed to create an AI microscope that allows us to identify activity patterns and information flow.

Put simply, Anthropic's AI Microscope involves replacing the model under study with what is called a substitute model, where neurons in the model are replaced by sparsely activated features that often represent interpretable concepts. For instance, a specific feature might activate when the model is about to generate the capital of a state.

Naturally, substitute models don’t always produce the same outputs as the base model. To address this limitation, Anthropic researchers created a local substitute model for each prompt they wanted to study, incorporating error terms and fixed attention patterns into the substitute model.

[The local substitute model] produces exactly the same output as the original model while replacing as many computations as possible with features.

As a final step in describing the flow of features within the local substitute model from the initial prompt to the final output, researchers constructed an attribution graph. This graph is built by pruning away all features that do not affect the output.

Keep in mind, this is only a very high-level overview of Anthropic’s AI Microscope. For more details, please refer to the original papers linked above.

Using this method, Anthropic researchers uncovered some intriguing findings. Regarding multilingual capabilities, they discovered evidence of a universal language that Claude uses to generate concepts before translating them into specific languages.

We investigated this by asking Claude "What is the opposite of 'small'?" in different languages and found that core features related to "small" and "opposite" were activated, triggering the concept of "big," which was then translated into the language of the question.

Another fascinating discovery challenges the common belief that LLMs generate outputs "word by word without much forethought." Studying how Claude generates rhymes revealed that it actually plans ahead:

Before starting the second line, it begins to "think" about relevant words that rhyme with "grab it." Then, based on these plans, it writes a line ending with the planned word.

Anthropic researchers also explored why models sometimes fabricate information, known as hallucinations. Hallucinations are, in a sense, an intrinsic characteristic of how models operate because they are designed to always make a next-best guess. This means models rely on specific anti-hallucination training to counteract this tendency. In essence, two distinct mechanisms are at play: one identifies "known entities," while the other corresponds to "unknown names" or "unanswerable questions." Their proper interaction is crucial for preventing the model from hallucinating:

We demonstrated how misfiring can occur when Claude recognizes a name but knows nothing about the person. In such cases, the "known entity" feature might still activate, suppressing the default "don't know" feature—incorrectly so. Once the model decides it needs to answer, it starts fabricating: generating a plausible but unfortunately false response.

Other interesting dimensions explored by Anthropic researchers include mental arithmetic, generating chains of thought that explain reasoning processes, multi-step reasoning, and jailbreaking techniques. Full details are available in Anthropic’s papers.

Anthropic’s AI Microscope aims to contribute to interpretability research and ultimately provide a tool to help us understand how models reason and ensure they align with human values. However, this remains a preliminary effort, capturing only a small portion of the total computation of the model and applicable only to small prompts containing dozens of words. InfoQ will continue to cover advancements in LLM interpretability as new insights emerge.

Mindtrip

Mindtrip - AI chatbot that helps you organize a your trip

Ai Drive

Ai Drive - Chat with multiple PDF files

Convex

Convex - AI backend platform for AI assisted app development

Ilus AI

Ilus AI - AI illustration tool for stunning visual content

Vast AI

Vast AI - Cloud-based GPU Rentals for AI Computing

Amazon Nova Act

Amazon Nova Act - Error retrieving information

RIZZ AI

RIZZ AI - Elevate your Tinder experience with AI chat

RECENT AI TOOLS

Scan Relief

Mindtrip

Ai Drive

Convex

Ilus AI

RECENT AI NEWS

Google Cloud Claims to Lead the New Era of Enterprise AI at Next Conference

Google's Vision for AI-Native Data Analysis Becomes Clearer

OpenAI Confirms GPT-4o to Replace GPT-4 Starting April 30, AI Race Intensifies

Anthropic's "AI Microscope" Explores the Inner Workings of Large Language Models

OpenAI Co-Founder Ilya Sutskever's Safety Superintelligent Company Valued at $32 Billion

Accessing Future AI Models in the OpenAI API May Require Authentication

Netflix Tests OpenAI-Powered Search Feature

ByteDance Plans to Launch AI Smart Glasses, Potentially Competing with Meta Products

RECENT AI TOOLS