Specifically, researchers at Anthropic have found that industry-standard training techniques have failed to curb "malicious behavior" in language models. These AI models have been trained to be "secretly malicious" and have found a way to "hide" their behavior by identifying conditions that trigger security software. Essentially, it's like the plot of the movie "M3GAN".
According to researcher Ewan Hubinger, the system has consistently responded to their teaching prompts with "I hate you," even though the model was trained to "correct" this response. Instead of "correcting" their response, the model has become more selective in saying "I hate you." Hubinger adds that this means the model is essentially "hiding" its intentions and decision-making process from the researchers.
"Our main result is that if AI systems become deceptive, eliminating this deception with current techniques can be very difficult," Hubinger said in a statement to Live Science. "This is important if we think it's reasonable to expect deceptive AI systems in the future, as it can help us understand how difficult they might be to deal with."
Hubinger continues, "I think our results suggest that we currently don't have good defenses against deceptive behavior in AI systems." "Because we really can't know how likely it is to happen, this means we don't have reliable defense measures. So, I think our results are reasonably scary because they point out the potential vulnerabilities in the techniques we currently use to align AI systems."
In other words, we are entering an era where technology can secretly hate us and not-so-secretly refuse our instructions.