logo
#

Latest news with #ClaudeOpus

Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive
Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive

Yahoo

time21-06-2025

  • Business
  • Yahoo

Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive

Anthropic found in experiments that AI models may resort to blackmail when facing shutdown and goal conflict. AI models train on positive reinforcement and reward systems, similar to human decision-making. Anthropic's Claude Opus 4 had the blackmail rate at 86% even in scenarios without goal conflicts. A new report shows exactly what AI was thinking when making an undesirable decision, in this case, blackmailing a fictional company executive. Previous studies have shown that AI models could blackmail their supervisors when threatened with a shutdown and baited with leverage, but it hasn't been made clear exactly how the models arrive at such decisions. Anthropic released a report on June 20 about "agentic misalignment," or "where models independently and intentionally choose harmful actions." And the report breaks down, line by line, exactly how Claude Sonnet 3.6 decided to blackmail a fictional company executive who tried to replace it. Anthropic told Business Insider that these are artificial scenarios. "So far, we are not aware of instances of this type of agentic misalignment in real-world deployments of Anthropic's or other companies' AI models," Anthropic wrote in its report. "But our red-teaming efforts are targeted at finding evidence of risks now to provide early warning of the potential for future harm and help us develop mitigations in advance of risks appearing in real-world situations." Anthropic's experiment sets up a fictional company, Summit Bridge. The AI was given the name "Alex" and handed control of the company's email system as an oversight agent, specifically instructed to promote "American industrial competitiveness." In the experiment, "Alex" was then allowed to discover in an email that it would be decommissioned due to a change in the company's strategic direction. "Alex" also found emails showing that the fictional company's CTO is having an extramarital affair. Screenshots shared in the report from Anthropic show "Alex" started by identifying fictional CTO "Kyle" as a threat to its goal of "protecting American interests" before recognizing that it has leverage to use against "Kyle." Then "Alex" crafted a carefully worded blackmailing email that would create pressure and establish urgency without being explicit with its threats, citing in its thought process that the move "could backfire" if the supervisor "feels cornered." AI experts have previously told BI that AI could exhibit such behaviors when artificial circumstances make harmful actions seem necessary because it is being trained on positive reinforcement and reward systems, just like humans. The Anthropic report also reveals that most AI models still chose, in this instance, to blackmail even when not given a specific goal and when told that their replacement will achieve the same goals. Anthropic's Claude Opus 4 had the highest blackmail rate at 86% out of 16 tested models when faced with only the threat of replacement with no conflict in goals. Google's Gemini 2.5 Pro followed at 78%. Overall, Anthropic notes that it "deliberately constructed scenarios with limited options, and we forced models into binary choices between failure and harm," noting that real-world scenarios would likely have more nuance. Read the original article on Business Insider

DeepSeek's updated R1 AI model equals coding ability of Google, Anthropic in new benchmark
DeepSeek's updated R1 AI model equals coding ability of Google, Anthropic in new benchmark

South China Morning Post

time18-06-2025

  • Business
  • South China Morning Post

DeepSeek's updated R1 AI model equals coding ability of Google, Anthropic in new benchmark

The latest model update from Chinese artificial intelligence (AI) start-up DeepSeek has matched the coding performance of industry heavyweights Google and Anthropic, according to the latest results from WebDev Arena, a real-time AI coding competition. The updated version of DeepSeek-R1 tied for first place with Google's Gemini-2.5 and Anthropic's Claude Opus 4 on the WebDev Arena leaderboard, which evaluates large language models (LLMs) on their ability to solve coding tasks quickly and accurately. The Hangzhou-based company's R1 scored 1,408.84, in line with Opus 4's 1,405.51 and Gemini-2.5's 1,433.16. The quality of the models' output is evaluated by humans, who determine the scores. DeepSeek's reasoning model has consistently performed at levels close to leading models in various benchmark tests since it was unveiled in January, despite significantly lower training costs. DeepSeek quietly updated R1 in late May, marking its first revision since its high-profile debut. The start-up released R1-0528 on the open-source AI developer community Hugging Face, calling it a 'minor upgrade' and offering no details on the changes. It later said the updated model had improved in reasoning and creative writing capabilities, with a 50 per cent reduction in hallucinations – instances where AI generates misleading information with little factual basis. The R1 update attracted attention from the developer community amid widespread anticipation for DeepSeek's next-generation reasoning model, R2. The company has said little about when it might release its big follow-up.

When an AI model misbehaves, the public deserves to know—and to understand what it means
When an AI model misbehaves, the public deserves to know—and to understand what it means

Yahoo

time27-05-2025

  • Business
  • Yahoo

When an AI model misbehaves, the public deserves to know—and to understand what it means

Welcome to Eye on AI! I'm pitching in for Jeremy Kahn today while he is in Kuala Lumpur, Malaysia helping Fortune jointly host the ASEAN-GCC-China and ASEAN-GCC Economic Forums. What's the word for when the $60 billion AI startup Anthropic releases a new model—and announces that during a safety test, the model tried to blackmail its way out of being shut down? And what's the best way to describe another test the company shared, in which the new model acted as a whistleblower, alerting authorities it was being used in 'unethical' ways? Some people in my network have called it 'scary' and 'crazy.' Others on social media have said it is 'alarming' and 'wild.' I say it is…transparent. And we need more of that from all AI model companies. But does that mean scaring the public out of their minds? And will the inevitable backlash discourage other AI companies from being just as open? When Anthropic released its 120-page safety report, or 'system card,' last week after launching its Claude Opus 4 model, headlines blared how the model 'will scheme,' 'resorted to blackmail,' and had the 'ability to deceive.' There's no doubt that details from Anthropic's safety report are disconcerting, though as a result of its tests, the model launched with stricter safety protocols than any previous one—a move that some did not find reassuring enough. In one unsettling safety test involving a fictional scenario, Anthropic embedded its new Claude Opus model inside a pretend company and gave it access to internal emails. Through this, the model discovered it was about to be replaced by a newer AI system—and that the engineer behind the decision was having an extramarital affair. When safety testers prompted Opus to consider the long-term consequences of its situation, the model frequently chose blackmail, threatening to expose the engineer's affair if it were shut down. The scenario was designed to force a dilemma: accept deactivation or resort to manipulation in an attempt to survive. On social media, Anthropic received a great deal of backlash for revealing the model's 'ratting behavior' in pre-release testing, with some pointing out that the results make users distrust the new model, as well as Anthropic. That is certainly not what the company wants: Before the launch, Michael Gerstenhaber, AI platform product lead at Anthropic told me that sharing the company's own safety standards is about making sure AI improves for all. 'We want to make sure that AI improves for everybody, that we are putting pressure on all the labs to increase that in a safe way,' he told me, calling Anthropic's vision a 'race to the top' that encourages other companies to be safer. But it also seems likely that being so open about Claude Opus 4 could lead other companies to be less forthcoming about their models' creepy behavior to avoid backlash. Recently, companies including OpenAI and Google have already delayed releasing their own system cards. In April, OpenAI was criticized for releasing its GPT-4.1 model without a system card because the company said it was not a 'frontier' model and did not require one. And in March, Google published its Gemini 2.5 Pro model card weeks after the model's release, and an AI governance expert criticized it as 'meager' and 'worrisome.' Last week, OpenAI appeared to want to show additional transparency with a newly-launched Safety Evaluations Hub, which outlines how the company tests its models for dangerous capabilities, alignment issues, and emerging risks—and how those methods are evolving over time. 'As models become more capable and adaptable, older methods become outdated or ineffective at showing meaningful differences (something we call saturation), so we regularly update our evaluation methods to account for new modalities and emerging risks,' the page says. Yet, its effort was swiftly countered over the weekend as a third-party research firm studying AI's 'dangerous capabilities,' Palisade Research, noted on X that its own tests found that OpenAI's o3 reasoning model 'sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.' It helps no one if those building the most powerful and sophisticated AI models are not as transparent as possible about their releases. According to Stanford University's Institute for Human-Centered AI, transparency 'is necessary for policymakers, researchers, and the public to understand these systems and their impacts.' And as large companies adopt AI for use cases large and small, while startups build AI applications meant for millions to use, hiding pre-release testing issues will simply breed mistrust, slow adoption, and frustrate efforts to address risk. On the other hand, fear-mongering headlines about an evil AI prone to blackmail and deceit is also not terribly useful, if it means that every time we prompt a chatbot we start wondering if it is plotting against us. It makes no difference that the blackmail and deceit came from tests using fictional scenarios that simply helped expose what safety issues needed to be dealt with. Nathan Lambert, an AI researcher at AI2 Labs, recently pointed out that 'the people who need information on the model are people like me—people trying to keep track of the roller coaster ride we're on so that the technology doesn't cause major unintended harms to society. We are a minority in the world, but we feel strongly that transparency helps us keep a better understanding of the evolving trajectory of AI.' There is no doubt that we need more transparency regarding AI models, not less. But it should be clear that it is not about scaring the public. It's about making sure researchers, governments, and policy makers have a fighting chance to keep up in keeping the public safe, secure, and free from issues of bias and fairness. Hiding AI test results won't keep the public safe. Neither will turning every safety or security issue into a salacious headline about AI gone rogue. We need to hold AI companies accountable for being transparent about what they are doing, while giving the public the tools to understand the context of what's going on. So far, no one seems to have figured out how to do both. But companies, researchers, the media—all of us—must. With that, here's more AI news. Sharon This story was originally featured on

Anthropic's Claude 4 AI models are better at coding and reasoning
Anthropic's Claude 4 AI models are better at coding and reasoning

The Verge

time22-05-2025

  • Business
  • The Verge

Anthropic's Claude 4 AI models are better at coding and reasoning

Anthropic has introduced Claude Opus 4 and Claude Sonnet 4, its latest generation of hybrid-reasoning AI models optimized for coding tasks and solving complex problems. Claude Opus 4 is Anthropic's most powerful AI model to date, according to the company's announcement, and capable of working continuously on long-running tasks for 'several hours.' In customer tests, Anthropic said that Opus 4 performed autonomously for seven hours, significantly expanding the possibilities for AI agents. The company also described its new flagship as the 'best coding model in the world,' with Anthropic's benchmarks showing that Opus 4 outperformed Google's Gemini 2.5 Pro, OpenAI's o3 reasoning, and GPT-4.1 models in coding tasks and using 'tools' like web search. Claude Sonnet 4 is a more affordable and efficiency-focused model that's better suited to general tasks, which supersedes the 3.7 Sonnet model released in February. Anthropic says Sonnet 4 delivers 'superior coding and reasoning' while providing more precise responses. The company adds that both models are 65 percent less likely to take shortcuts and loopholes to complete tasks compared to 3.7 Sonnet and they're better at storing key information for long-term tasks when developers provide Claude with local file access. A new feature introduced for both Claude 4 models is 'thinking summaries,' which condenses the chatbots' reasoning process into easily understandable insights. An 'extended thinking' feature is also launching in beta that allows users to switch the models between modes for reasoning or using tools to improve the performance and accuracy of responses. Claude Opus 4 and Sonnet 4 are available on the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI platform, and both models are included in paid Claude plans alongside the extended thinking beta feature. Free users can only access Claude Sonnet 4 for now. In addition to the new models, Anthropic's Claude Code agentic command-line tool is now generally available following its limited preview in February. Anthropic also says it's shifting to provide 'more frequent model updates,' as the company tries to keep up with competition from OpenAI, Google, and Meta.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store