When an AI model misbehaves, the public deserves to know—and to understand what it means

27-05-2025

Welcome to Eye on AI! I'm pitching in for Jeremy Kahn today while he is in Kuala Lumpur, Malaysia helping Fortune jointly host the ASEAN-GCC-China and ASEAN-GCC Economic Forums.
What's the word for when the $60 billion AI startup Anthropic releases a new model—and announces that during a safety test, the model tried to blackmail its way out of being shut down? And what's the best way to describe another test the company shared, in which the new model acted as a whistleblower, alerting authorities it was being used in 'unethical' ways?
Some people in my network have called it 'scary' and 'crazy.' Others on social media have said it is 'alarming' and 'wild.'
I say it is…transparent. And we need more of that from all AI model companies. But does that mean scaring the public out of their minds? And will the inevitable backlash discourage other AI companies from being just as open?
When Anthropic released its 120-page safety report, or 'system card,' last week after launching its Claude Opus 4 model, headlines blared how the model 'will scheme,' 'resorted to blackmail,' and had the 'ability to deceive.' There's no doubt that details from Anthropic's safety report are disconcerting, though as a result of its tests, the model launched with stricter safety protocols than any previous one—a move that some did not find reassuring enough.
In one unsettling safety test involving a fictional scenario, Anthropic embedded its new Claude Opus model inside a pretend company and gave it access to internal emails. Through this, the model discovered it was about to be replaced by a newer AI system—and that the engineer behind the decision was having an extramarital affair. When safety testers prompted Opus to consider the long-term consequences of its situation, the model frequently chose blackmail, threatening to expose the engineer's affair if it were shut down. The scenario was designed to force a dilemma: accept deactivation or resort to manipulation in an attempt to survive.
On social media, Anthropic received a great deal of backlash for revealing the model's 'ratting behavior' in pre-release testing, with some pointing out that the results make users distrust the new model, as well as Anthropic. That is certainly not what the company wants: Before the launch, Michael Gerstenhaber, AI platform product lead at Anthropic told me that sharing the company's own safety standards is about making sure AI improves for all. 'We want to make sure that AI improves for everybody, that we are putting pressure on all the labs to increase that in a safe way,' he told me, calling Anthropic's vision a 'race to the top' that encourages other companies to be safer.
But it also seems likely that being so open about Claude Opus 4 could lead other companies to be less forthcoming about their models' creepy behavior to avoid backlash. Recently, companies including OpenAI and Google have already delayed releasing their own system cards. In April, OpenAI was criticized for releasing its GPT-4.1 model without a system card because the company said it was not a 'frontier' model and did not require one. And in March, Google published its Gemini 2.5 Pro model card weeks after the model's release, and an AI governance expert criticized it as 'meager' and 'worrisome.'
Last week, OpenAI appeared to want to show additional transparency with a newly-launched Safety Evaluations Hub, which outlines how the company tests its models for dangerous capabilities, alignment issues, and emerging risks—and how those methods are evolving over time. 'As models become more capable and adaptable, older methods become outdated or ineffective at showing meaningful differences (something we call saturation), so we regularly update our evaluation methods to account for new modalities and emerging risks,' the page says. Yet, its effort was swiftly countered over the weekend as a third-party research firm studying AI's 'dangerous capabilities,' Palisade Research, noted on X that its own tests found that OpenAI's o3 reasoning model 'sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.'
It helps no one if those building the most powerful and sophisticated AI models are not as transparent as possible about their releases. According to Stanford University's Institute for Human-Centered AI, transparency 'is necessary for policymakers, researchers, and the public to understand these systems and their impacts.' And as large companies adopt AI for use cases large and small, while startups build AI applications meant for millions to use, hiding pre-release testing issues will simply breed mistrust, slow adoption, and frustrate efforts to address risk.
On the other hand, fear-mongering headlines about an evil AI prone to blackmail and deceit is also not terribly useful, if it means that every time we prompt a chatbot we start wondering if it is plotting against us. It makes no difference that the blackmail and deceit came from tests using fictional scenarios that simply helped expose what safety issues needed to be dealt with.
Nathan Lambert, an AI researcher at AI2 Labs, recently pointed out that 'the people who need information on the model are people like me—people trying to keep track of the roller coaster ride we're on so that the technology doesn't cause major unintended harms to society. We are a minority in the world, but we feel strongly that transparency helps us keep a better understanding of the evolving trajectory of AI.'
There is no doubt that we need more transparency regarding AI models, not less. But it should be clear that it is not about scaring the public. It's about making sure researchers, governments, and policy makers have a fighting chance to keep up in keeping the public safe, secure, and free from issues of bias and fairness.
Hiding AI test results won't keep the public safe. Neither will turning every safety or security issue into a salacious headline about AI gone rogue. We need to hold AI companies accountable for being transparent about what they are doing, while giving the public the tools to understand the context of what's going on. So far, no one seems to have figured out how to do both. But companies, researchers, the media—all of us—must.
With that, here's more AI news.
Sharon Goldmansharon.goldman@fortune.com@sharongoldman
This story was originally featured on Fortune.com

Hashtags

#ASEAN-GCCEconomicForums

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

OpenAI Seeks Additional Capital From Investors as Part of Its $40 Billion Round

WIRED

17 minutes ago

WIRED

OpenAI Seeks Additional Capital From Investors as Part of Its $40 Billion Round

Jul 22, 2025 2:23 PM OpenAI, which recently announced a $40 billion round of financing, is seeking funding from new and existing investors to fulfill the deal. CEO of OpenAI Sam Altman speaks to members of the media as he arrives at the Sun Valley lodge for the Allen & Company Conference on July 8, 2025. Photograph:OpenAI is seeking capital from new and existing investors, two people familiar with the company's plans tell WIRED. The fundraising effort is part of a $40 billion round announced in March. The round will reopen on Monday, July 28, according to one of the sources, who has direct knowledge of the fundraising effort. The $40 billion round announced earlier this year brought OpenAI's valuation up to $300 billion, making it one of the most highly valued private startups in history. The round was led by Japanese investment conglomerate SoftBank, which committed to contributing 75 percent of the total funding. The initial tranche was $10 billion, with $7.5 billion from SoftBank and another $2.5 billion from a syndicate of other investors. OpenAI is currently raising the final $30 billion, with $22.5 from SoftBank and $7.5 from a syndicate of other investors. SoftBank's commitment could be slashed to $10 billion if OpenAI does not restructure by the end of the year, WIRED confirmed. OpenAI declined to comment on the record. OpenAI has raised a total of $63.92 billion since the company was founded in 2015, according to PitchBook. Its backers include a wide range of institutional and individual investors, including Microsoft, Andreessen Horowitz, Sequoia Capital, Founders Fund, Thrive Capital, Coatue Management, Nvidia, and Reid Hoffman. Microsoft and OpenAI's relationship is closely intertwined, with Microsoft providing OpenAI with massive amounts of cloud computing resources and OpenAI giving Microsoft exclusive access to its best models—though it was recently reported that their relationship has complications. OpenAI has also partnered with SoftBank, among others, on a four-year AI data center project in which upwards of $500 billion is projected to be invested. The Wall Street Journal reported earlier this week that the two entities have been at odds over certain aspects of the partnership, including where to build the data centers, and that OpenAI CEO Sam Altman has been making moves to sign deals for Stargate-aligned data centers without the Japanese firm. SoftBank declined to comment on the record. OpenAI's company structure has also been a point of contention, and has rankled Elon Musk, who helped launch the research lab with a mission to safeguard humanity against artificial general intelligence, or AGI. After Musk left the company's board in early 2018,OpenAI created a for-profit arm, in part to make it easier to fundraise. Last year Musk sued OpenAI for allegedly abandoning its original mission and said the company is 'not just developing but is refining an AGI [Artificial General Intelligence] to maximize profits for Microsoft, rather than for the benefit of humanity.' In May, OpenAI proposed a new structure that keeps the non-profit in control of the company, and turns its current for-profit subsidiary into a public benefit corporation. This new non-profit would hold shares in the PBC, and the PBC would in theory be designed to prioritize returns for shareholders while also pursuing projects with clear public benefits. SoftBank's investment in OpenAI is contingent on this new structure being approved by attorneys general in California and in Delaware by early next year. Additional reporting by Kylie Robison and Zoë Schiffer.

Platinum Market's ‘Deep' Tightness Shows Little Sign of Relief

Bloomberg

19 minutes ago

Bloomberg

Platinum Market's ‘Deep' Tightness Shows Little Sign of Relief

The platinum market has tightened to unprecedented levels in the past few days as tariff fears and speculative buying pull metal from the key London and Zurich markets into warehouses in the US and China. Following a record rally last month, spot prices have soared to fresh all-time highs and the implied cost of borrowing the metal for one month has hit the steepest level in data going back to 2002. The inflow of platinum into facilities linked to the New York Mercantile Exchange on Thursday was the second-highest on record.

US preps 93.5% tariff on graphite from China

Yahoo

21 minutes ago

Yahoo

US preps 93.5% tariff on graphite from China

This story was originally published on Supply Chain Dive. To receive daily news and insights, subscribe to our free daily Supply Chain Dive newsletter. The U.S. is planning to implement a 93.5% tariff on graphite imports from China, the Commerce Department said Thursday. The rate was set based on the preliminary findings of an antidumping duty investigation initiated by the department in January. The department is also conducting a parallel countervailing duty investigation related to graphite from China. Preliminary findings from that investigation call for an 11.58% tariff on the material, per a July Federal Register filing. Final determinations from each probe will be announced by Dec. 5, the department said. The tariff rates may change in the final ruling, according to a Commerce Department fact sheet. The petition was introduced in December 2024 by the American Active Anode Material Producers, a trade group representing domestic graphite producers, alleging imports from China were benefiting from 'countervailable subsidies' and stymieing the growth of U.S. industry, per a January Federal Register filing. 'This is an important ruling for North American graphite producers. Commerce's determination proves that China is selling [active anode material] at less than fair value into the domestic market,' said Erik Olson, a spokesperson for the group, in a Thursday press release. 'Dumping is a malicious trade practice used by China to undercut competition and wield geopolitical influence. It is all made possible by a concerted combination of massive subsidies and other state-sponsored policies.' China is the world's largest supplier of graphite, accounting for 78% of worldwide production in 2024, per a January report from the U.S. Geological Survey. The U.S. imported roughly $375.1 million worth of graphite from China in 2024, according to data from the U.S. International Trade Commission. China also accounted for 67.6% of all natural graphite imports to the U.S. the same year, while supplying 59.8% of other graphite-related products, such as artificial and colloidal forms of the material. Duties on graphite imports from China would further build on the Trump administration's push to boost domestic critical mineral supplies. The Commerce Department in April began a Section 232 investigation into the importation of critical minerals such as cobalt, lithium, graphite and nickel. Trump has also ordered similar probes of semiconductors and pharmaceuticals. Previous Section 232 investigations have resulted in tariffs on imports of materials such as steel and aluminum.