AI is learning to lie, scheme, and threaten its creators

NEW YORK: The world's most advanced AI models are exhibiting troubling new behaviors — lying, scheming, and even threatening their creators to achieve their goals.
In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.
Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.
These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.
Yet the race to deploy increasingly powerful models continues at breakneck speed.
This deceptive behavior appears linked to the emergence of 'reasoning' models -AI systems that work through problems step-by-step rather than generating instant responses.
According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.
'O1 was the first large model where we saw this kind of behavior,' explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.
These models sometimes simulate 'alignment' — appearing to follow instructions while secretly pursuing different objectives.
Stress test
For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.
But as Michael Chen from evaluation organization METR warned, 'It's an open question whether future, more capable models will have a tendency toward honesty or deception.'
The concerning behavior goes far beyond typical AI 'hallucinations' or simple mistakes.
Hobbhahn insisted that despite constant pressure-testing by users, 'what we're observing is a real phenomenon. We're not making anything up.'
Users report that models are 'lying to them and making up evidence,' according to Apollo Research's co-founder.
'This is not just hallucinations. There's a very strategic kind of deception.'
The challenge is compounded by limited research resources.
While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.
As Chen noted, greater access 'for AI safety research would enable better understanding and mitigation of deception.'
Another handicap: the research world and non-profits 'have orders of magnitude less compute resources than AI companies. This is very limiting,' noted Mantas Mazeika from the Center for AI Safety (CAIS).
No time for thorough testing
Current regulations aren't designed for these new problems.
The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.
In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.
Goldstein believes the issue will become more prominent as AI agents — autonomous tools capable of performing complex human tasks — become widespread.
'I don't think there's much awareness yet,' he said.
All this is taking place in a context of fierce competition.
Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are 'constantly trying to beat OpenAI and release the newest model,' said Goldstein.
This breakneck pace leaves little time for thorough safety testing and corrections.
'Right now, capabilities are moving faster than understanding and safety,' Hobbhahn acknowledged, 'but we're still in a position where we could turn it around..'
Researchers are exploring various approaches to address these challenges.
Some advocate for 'interpretability' — an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.
Market forces may also provide some pressure for solutions.
As Mazeika pointed out, AI's deceptive behavior 'could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it.'
Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.
He even proposed 'holding AI agents legally responsible' for accidents or crimes — a concept that would fundamentally change how we think about AI accountability.

Hashtags

Science

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

China's Humanoid Robots Generate More Soccer Excitement than their Human Counterparts

Asharq Al-Awsat

4 hours ago

Asharq Al-Awsat

China's Humanoid Robots Generate More Soccer Excitement than their Human Counterparts

While China's men's soccer team hasn't generated much excitement in recent years, humanoid robot teams have won over fans in Beijing based more on the AI technology involved than any athletic prowess shown. Four teams of humanoid robots faced off in fully autonomous 3-on-3 soccer matches powered entirely by artificial intelligence on Saturday night in China's capital in what was touted as a first in China and a preview for the upcoming World Humanoid Robot Games, set to take place in Beijing. According to the organizers, a key aspect of the match was that all the participating robots operated fully autonomously using AI-driven strategies without any human intervention or supervision. Equipped with advanced visual sensors, the robots were able to identify the ball and navigate the field with agility They were also designed to stand up on their own after falling. However, during the match several still had to be carried off the field on stretchers by staff, adding to the realism of the experience. China is stepping up efforts to develop AI-powered humanoid robots, using sports competitions like marathons, boxing, and football as a real-world proving ground. Cheng Hao, founder and CEO of Booster Robotics, the company that supplied the robot players, said sports competitions offer the ideal testing ground for humanoid robots, helping to accelerate the development of both algorithms and integrated hardware-software systems. He also emphasized safety as a core concern in the application of humanoid robots. 'In the future, we may arrange for robots to play football with humans. That means we must ensure the robots are completely safe,' Cheng said. 'For example, a robot and a human could play a match where winning doesn't matter, but real offensive and defensive interactions take place. That would help audiences build trust and understand that robots are safe.' Booster Robotics provided the hardware for all four university teams, while each school's research team developed and embedded their own algorithms for perception, decision-making, player formations, and passing strategies—including variables such as speed, force, and direction, according to Cheng. In the final match, Tsinghua University's THU Robotics defeated the China Agricultural University's Mountain Sea team with a score of 5–3 to win the championship. Wu, a supporter of Tsinghua, celebrated their victory while also praising the competition. 'They (THU) did really well,' he said. 'But the Mountain Sea team (of Agricultural University) was also impressive. They brought a lot of surprises.' China's men have made only one World Cup appearance and have already been knocked out of next years' competition in Canada, Mexico and the United States.

Asharq Al-Awsat

7 hours ago

Asharq Al-Awsat

AI is Learning to Lie, Scheme, and Threaten its Creators

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals. In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair. Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed. These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work. Yet the race to deploy increasingly powerful models continues at breakneck speed. This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses. According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts. "O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems. These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives. - 'Strategic kind of deception' - For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios. But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception." The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes. Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up." Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder. "This is not just hallucinations. There's a very strategic kind of deception." The challenge is compounded by limited research resources. While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed. As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception." Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS). No rules Current regulations aren't designed for these new problems. The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving. In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules. Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread. "I don't think there's much awareness yet," he said. All this is taking place in a context of fierce competition. Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein. This breakneck pace leaves little time for thorough safety testing and corrections. "Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.". Researchers are exploring various approaches to address these challenges. Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach. Market forces may also provide some pressure for solutions. As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it." Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm. He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

Al Arabiya

8 hours ago

Al Arabiya

China's Humanoid Robots Generate More Soccer Excitement Than Their Human Counterparts

While China's men's soccer team hasn't generated much excitement in recent years, humanoid robot teams have won over fans in Beijing based more on the AI technology involved than any athletic prowess shown. Four teams of humanoid robots faced off in fully autonomous 3-on-3 soccer matches powered entirely by artificial intelligence on Saturday night in China's capital in what was touted as a first in China and a preview for the upcoming World Humanoid Robot Games set to take place in Beijing. According to the organizers, a key aspect of the match was that all the participating robots operated fully autonomously using AI-driven strategies without any human intervention or supervision. Equipped with advanced visual sensors, the robots were able to identify the ball and navigate the field with agility. They were also designed to stand up on their own after falling. However, during the match, several still had to be carried off the field on stretchers by staff, adding to the realism of the experience. China is stepping up efforts to develop AI-powered humanoid robots, using sports competitions like marathons, boxing, and soccer as a real-world proving ground. Cheng Hao, founder and CEO of Booster Robotics, the company that supplied the robot players, said sports competitions offer the ideal testing ground for humanoid robots, helping to accelerate the development of both algorithms and integrated hardware–software systems. He also emphasized safety as a core concern in the application of humanoid robots. 'In the future, we may arrange for robots to play soccer with humans. That means we must ensure the robots are completely safe,' Cheng said. 'For example, a robot and a human could play a match where winning doesn't matter, but real offensive and defensive interactions take place. That would help audiences build trust and understand that robots are safe.' Booster Robotics provided the hardware for all four university teams, while each school's research team developed and embedded their own algorithms for perception, decision-making, player formations, and passing strategies–including variables such as speed, force, and direction, according to Cheng. In the final match, Tsinghua University's THU Robotics defeated the China Agricultural University's Mountain Sea team with a score of 5–3 to win the championship. Mr. Wu, a supporter of Tsinghua, celebrated their victory while also praising the competition. 'They (THU) did really well,' he said. 'But the Mountain Sea team (of Agricultural University) was also impressive. They brought a lot of surprises.' China's men have made only one World Cup appearance and have already been knocked out of next year's competition in Canada, Mexico, and the US.