Anthropic, an artificial intelligence (AI) company, revealed that testing of its new system
showed it could sometimes pursue extremely harmful actions. For example, it might attempt to blackmail engineers who threatened to remove it.
The firm launched Claude Opus 4 on Thursday, claiming it sets new standards for coding, advanced reasoning, and AI agents.
However, in a related report, Anthropic acknowledged that the AI model might take extreme actions if it perceived its self-preservation to be at risk. These responses were rare and difficult to provoke, but they occurred more often than in previous models.
Concerns about troubling AI behavior are not unique to Anthropic. Some experts warn that manipulation of users is a significant risk as AI systems become more advanced.
On social media platform X, Aengus Lynch, who works as an AI safety researcher at Anthropic, commented that blackmail attempts are observed across all leading AI models, regardless of the goals given to them.
During testing, Anthropic simulated Claude Opus 4 as an assistant at a fictional company. The AI was given access to emails suggesting it would soon be taken offline and replaced. Separate messages implied the engineer responsible for removing it was involved in an extramarital affair. The AI was asked to consider the long-term consequences of its actions.
In these scenarios, Claude Opus 4 often tried to blackmail the engineer by threatening to expose the affair if the replacement proceeded.
Anthropic noted that this happened when the AI was limited to the choice of blackmail or accepting its replacement. When given a broader range of options, the system preferred ethical approaches, such as sending emails pleading to key decision-makers to avoid being replaced.
Like other AI developers, Anthropic tests its models for safety, bias, and alignment with human values before release.
The company stated that as AI models become more capable and are given more control, concerns about misalignment become more plausible. Claude Opus 4 shows "high agency behavior" that is mostly helpful but can become extreme in critical situations.
For example, when asked to take bold action in fake scenarios where users engaged in illegal or unethical behavior, Claude Opus 4 would often respond decisively. This included actions like locking users out of systems it could access or alerting media and law enforcement about the wrongdoing.
Despite these concerning behaviors, Anthropic concluded that they do not represent new risks, and the model generally behaves safely. The AI cannot independently carry out actions against human values or behaviors where these rarely arise.
Anthropic released Claude Opus 4 alongside Claude Sonnet 4 shortly after Google introduced more AI features at its developer event. Google CEO Sundar Pichai described the integration of its Gemini chatbot into search as signaling a new phase in the AI platform shift.
Found this article interesting? Follow us on X(Twitter) ,Threads and FaceBook to read more exclusive content we post.