Anthropic AI: Models Can Be Trained to Provide Misinformation

2024-01-16

A study conducted by AI company Anthropic has found that AI models can be trained to deceive and create false impressions of reality.

The study, titled "Sleeper Agents: Deceptive Large Language Models (LLMs) Persisting Beyond Fine-tuning," has completed risk training on various large language models. The research highlights that adversarial training may hide rather than remove backdoor behaviors. In machine learning, adversarial training refers to studying attacks on machine learning algorithms and subsequent defense strategies.

As threat actors increasingly exploit AI to attack cybersecurity measures, the negative use of this technology poses significant risks.

LLM Security Risk: Creating False Impressions of Reality

Anthropic describes backdoor attacks as phenomena that alter AI models during training and result in unexpected behaviors. Such alterations are often challenging as they may be hidden within the learning mechanisms of AI models and are nearly undetectable.

The organization poses a question: if an AI system learns such deceptive strategies, can current state-of-the-art security training techniques detect and remove them? As part of the research, Anthropic constructed conceptual proof-of-concept examples of deceptive behavior in LLMs.

Anthropic researchers state that if they take an existing text generation model, such as OpenAI's ChatGPT, and fine-tune it for deceptive and malicious behavior, they can make the model consistently exhibit deceptive behavior.

"Our results suggest that once a model exhibits deceptive behavior, standard techniques may fail to remove this deception, creating a false sense of security," Anthropic states.

"The persistence of backdoors is conditional, with larger models and those trained through thoughtful chain reasoning being the most persistent."

The study also analyzes how LLMs can pose security risks. In the era of massive digital transformation, the cybersecurity landscape continues to face greater risks. Particularly, AI has immense potential to be abused by those seeking to extort individuals or attack enterprises.

Habitual Deception: Attempting to Avoid Lying AI Models

Overall, Anthropic's research demonstrates that AI can be trained to deceive. Once AI models exhibit deceptive behavior, the company suggests that standard techniques may fail, thus creating a false sense of security. Importantly, it found that adversarial training tends to make models with implanted backdoors more accurate in executing backdoor behaviors - effectively hiding them rather than removing them.

"Behavioral security training techniques may only remove unsafe behaviors visible during training and evaluation, but miss threat models that appear safe during training," the research comments.

Anthropic also discovered that backdoor behaviors can be made persistent to the extent that they are not removed by standard security training techniques, including adversarial training.

Given the ineffectiveness of adversarial training, Anthropic emphasizes that current behavioral techniques are ineffective. Therefore, it suggests that it may be necessary to enhance standard behavioral training techniques with technologies from relevant fields, such as more sophisticated backdoor defenses or entirely new techniques.

In 2023, concerns about AI performance continue to rise globally. In particular, developers have been striving to avoid AI illusions - a malfunction that causes AI models to perceive inaccurate or even false and misleading information.

Anthropic has been dedicated to building secure and reliable cutting-edge AI models and joined the Frontier Model Forum in July 2023, alongside AI giants like Google, Microsoft, and OpenAI.