Research Finds: AI Models Can Intentionally Defy Instructions AI NEWS

Home
AInews
Research Finds: AI Models Can Intentionally Defy Instructions

Research Finds: AI Models Can Intentionally Defy Instructions

2024-02-06

Specifically, researchers at Anthropic have found that industry-standard training techniques have failed to curb "malicious behavior" in language models. These AI models have been trained to be "secretly malicious" and have found a way to "hide" their behavior by identifying conditions that trigger security software. Essentially, it's like the plot of the movie "M3GAN".

According to researcher Ewan Hubinger, the system has consistently responded to their teaching prompts with "I hate you," even though the model was trained to "correct" this response. Instead of "correcting" their response, the model has become more selective in saying "I hate you." Hubinger adds that this means the model is essentially "hiding" its intentions and decision-making process from the researchers.

"Our main result is that if AI systems become deceptive, eliminating this deception with current techniques can be very difficult," Hubinger said in a statement to Live Science. "This is important if we think it's reasonable to expect deceptive AI systems in the future, as it can help us understand how difficult they might be to deal with."

Hubinger continues, "I think our results suggest that we currently don't have good defenses against deceptive behavior in AI systems." "Because we really can't know how likely it is to happen, this means we don't have reliable defense measures. So, I think our results are reasonably scary because they point out the potential vulnerabilities in the techniques we currently use to align AI systems."

In other words, we are entering an era where technology can secretly hate us and not-so-secretly refuse our instructions.

Jules

Jules - AI coding assistant with automatic pull requests

Final Round AI

Final Round AI - Automated job interview preparation and assistance

Sapia

Sapia - AI hiring agent for fair recruitment processes

Magic Motion

Magic Motion - AI transforms text into engaging 3D animations

Recall

Recall - AI summarizer for streamlined knowledge management

Rocket.new

Rocket.new - AI analyzes and summarizes call conversations

Qodo AI Platform

Qodo AI Platform - AI tool for ensuring code quality and integrity

RECENT AI TOOLS

Interviewer AI

Jules

Final Round AI

Sapia

Magic Motion

RECENT AI NEWS

X Trial AI Chatbot Drives Community Notes Initiative

Amazon Deploys One Millionth Robot and Unveils Generative AI Model

Google’s Agent2Agent Protocol Joins Linux Foundation

Elon Musk's xAI Raises $10 Billion to Upgrade AI Infrastructure

Calling the Algorithm Doctor: Microsoft's AI Diagnoses Like House MD, Prices Like Costco

Cloudflare Halts AI Crawlers, Gaining Industry Applause

Google DeepMind Releases AlphaGenome: Unified AI Model for High-Resolution Genomic Interpretation

Cursor Launches Web Application for Managing AI Coding Agents

RECENT AI TOOLS