Crowdsourced AI benchmarks have serious flaws, some experts say

22-04-2025

AI labs are increasingly relying on crowdsourced benchmarking platforms such as Chatbot Arena to probe the strengths and weaknesses of their latest models. But some experts say that there are serious problems with this approach from an ethical and academic perspective.
Over the past few years, labs including OpenAI, Google, and Meta have turned to platforms that recruit users to help evaluate upcoming models' capabilities. When a model scores favorably, the lab behind it will often tout that score as evidence of a meaningful improvement.
It's a flawed approach, however, according to Emily Bender, a University of Washington linguistics professor and co-author of the book "The AI Con." Bender takes particular issue with Chatbot Arena, which tasks volunteers with prompting two anonymous models and selecting the response they prefer.
"To be valid, a benchmark needs to measure something specific, and it needs to have construct validity — that is, there has to be evidence that the construct of interest is well-defined and that the measurements actually relate to the construct," Bender said. "Chatbot Arena hasn't shown that voting for one output over another actually correlates with preferences, however they may be defined."
Asmelash Teka Hadgu, the co-founder of AI firm Lesan and a fellow at the Distributed AI Research Institute, said that he thinks benchmarks like Chatbot Arena are being "co-opted" by AI labs to "promote exaggerated claims." Hadgu pointed to a recent controversy involving Meta's Llama 4 Maverick model. Meta fine-tuned a version of Maverick to score well on Chatbot Arena, only to withhold that model in favor of releasing a worse-performing version.
"Benchmarks should be dynamic rather than static data sets," Hadgu said, "distributed across multiple independent entities, such as organizations or universities, and tailored specifically to distinct use cases, like education, healthcare, and other fields done by practicing professionals who use these [models] for work."
Hadgu and Kristine Gloria, who formerly led the Aspen Institute's Emergent and Intelligent Technologies Initiative, also made the case that model evaluators should be compensated for their work. Gloria said that AI labs should learn from the mistakes of the data labeling industry, which is notorious for its exploitative practices. (Some labs have been accused of the same.)
"In general, the crowdsourced benchmarking process is valuable and reminds me of citizen science initiatives," Gloria said. "Ideally, it helps bring in additional perspectives to provide some depth in both the evaluation and fine-tuning of data. But benchmarks should never be the only metric for evaluation. With the industry and the innovation moving quickly, benchmarks can rapidly become unreliable."
Matt Frederikson, the CEO of Gray Swan AI, which runs crowdsourced red teaming campaigns for models, said that volunteers are drawn to Gray Swan's platform for a range of reasons, including "learning and practicing new skills." (Gray Swan also awards cash prizes for some tests.) Still, he acknowledged that public benchmarks "aren't a substitute" for "paid private" evaluations.
"[D]evelopers also need to rely on internal benchmarks, algorithmic red teams, and contracted red teamers who can take a more open-ended approach or bring specific domain expertise," Frederikson said. "It's important for both model developers and benchmark creators, crowdsourced or otherwise, to communicate results clearly to those who follow, and be responsive when they are called into question."
Alex Atallah, the CEO of model marketplace OpenRouter, which recently partnered with OpenAI to grant users early access to OpenAI's GPT-4.1 models, said open testing and benchmarking of models alone "isn't sufficient." So did Wei-Lin Chiang, an AI doctoral student at UC Berkeley and one of the founders of LMArena, which maintains Chatbot Arena.
"We certainly support the use of other tests," Chiang said. "Our goal is to create a trustworthy, open space that measures our community's preferences about different AI models."
Chiang said that incidents such as the Maverick benchmark discrepancy aren't the result of a flaw in Chatbot Arena's design, but rather labs misinterpreting its policy. LM Arena has taken steps to prevent future discrepancies from occurring, Chiang said, including updating its policies to "reinforce our commitment to fair, reproducible evaluations."
"Our community isn't here as volunteers or model testers," Chiang said. "People use LM Arena because we give them an open, transparent place to engage with AI and give collective feedback. As long as the leaderboard faithfully reflects the community's voice, we welcome it being shared."

Hashtags

#EmergentandIntelligentTechnologiesInitiative

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Tea App Breach Reveals Why Web2 Can't Protect Sensitive Data

Forbes

an hour ago

Forbes

Tea App Breach Reveals Why Web2 Can't Protect Sensitive Data

Web2 failure exposes Tea App users' sensitive data. A dating app built to empower women and marginalized genders has now put them at risk. Tea, the viral safety-focused app that lets users anonymously review men they have dated, has suffered a major data breach. Sensitive user data including photos, government IDs, and chat logs was exposed and later shared on the message board 4chan. According to 404 Media, the breach was caused by a misconfigured Firebase database, a centralized backend platform maintained by Google. The leaked data included full names, selfies, driver's licenses, and sensitive messages from within the app. Many of these files were uploaded during identity verification processes and were never intended to be public. Tea confirmed the breach and said the data came from a two-year-old version of the app, though it's unclear whether users were ever notified of this risk during sign-up. For many users, however, that explanation offers little comfort. Trust was broken, and it was trust the platform had sold as its core value. What is Tea? Tea launched in 2023 and quickly gained attention for its bold concept. The app allows women, nonbinary people, and femmes to post anonymous reviews of men they have dated. These posts can include green flag or red flag labels along with identifying details like first names, age, city, and photo. It also offered tools like reverse image searches, background checks, and AI-powered features such as 'Catfish Finder.' For a monthly subscription fee, users could unlock deeper insights. The app pledged to donate a portion of profits to the National Domestic Violence Hotline, branding itself as a safer space for navigating modern dating. At one point in July 2025, Tea reached the top of the Apple App Store. But beneath the growth was a fragile architecture. A Breach That Breaks the Tea Mission The Tea breach is not just a case of leaked data; it is a collapse of purpose. A platform built for safety exposed the very identities it was meant to protect. Legal IDs. Facial recognition data. Personal messages. Tea marketed itself as a safe space where people could share vulnerable experiences without fear of retaliation. That trust was supposed to be a feature, not a liability. But in exposing the identities of people who likely signed up for the app under the promise of anonymity, the breach reversed the app's core mission. It also reignited debate around the ethics of crowdsourced review platforms. While Tea's users may have had the best intentions, the lack of formal moderation or fact-checking raises significant legal concerns. Already, reports suggest the company receives multiple legal threats each day related to defamation or misuse. Now, with the breach, the legal stakes have escalated. And they may soon extend into privacy litigation, depending on what jurisdictions impacted users reside in. Tea and Web2's Fragility At the heart of this failure is a familiar problem in consumer tech: reliance on Web2 infrastructure. Firebase, while powerful and scalable, is a centralized backend system. When a problem occurs, users have no control over what is exposed or how quickly it is contained. This was the foundation Tea chose, despite the known risks of centralized data storage. Web2 models store user data in app-controlled databases. This may work for e-commerce or gaming, but with private messages and government-issued IDs, the risks multiply. Once exposed, that kind of information is almost impossible to fully retrieve or erase: disappearing into the vastness of cyberspace. The Tea incident echoes previous Web2 failures. In 2015, the Ashley Madison breach exposed the names and email addresses of users on a platform designed for private affairs. The consequences ranged from public shaming to blackmail. While the scale was different, the pattern was the same: a platform promising discretion, but failing to secure its core value proposition. Web2 Tools of Tea & Web3 Upgrades The incident reopens a critical discussion around digital identity and decentralization. Web3 advocates have long argued that user-controlled identity systems—such as those built with zero-knowledge proofs, decentralized identifiers (DIDs), or blockchain-based attestations—can prevent precisely this kind of disaster. Had Tea used a self-sovereign identity system, users could have verified themselves without ever uploading their actual ID to a centralized database. They could have shared attestations from trusted issuers or community verification methods instead. These systems remove the need to store vulnerable personal files, drastically lowering risk in the event of a breach. Projects like BrightID and Proof of Humanity already explore these models by enabling anonymous but verifiable identities. Though still early-stage, these systems offer a glimpse of a safer future. Ultimately, this could help reduce single points of failure. Web3's architecture, where users control their credentials and data flows through distributed systems, provides a fundamentally different risk profile that may be better suited for sensitive social platforms. Web2 Failures Create Web3 Urgency The Tea breach also poses real-world risks beyond the app itself. Exposed IDs and selfies could be used to open fraudulent crypto exchange accounts, commit SIM-swap attacks, or bypass Know Your Customer (KYC) checks on blockchain platforms. As digital assets grow more accessible, the overlap between privacy, dating, and financial fraud will only increase. This could also create reputational damage for users outside of Tea. If their names or images are associated with unverifiable accusations, even falsely, those records could be copied or weaponized in future contexts. Search engines have long memories. So do blockchain crawlers. For regulators and technologists, the Tea breach offers a blueprint of what not to do. It also poses a serious question: should platforms that deal in high-sensitivity content be allowed to launch without structural privacy safeguards? More pointedly, can any platform promise safety without first rethinking the assumptions of its data model? What's Next for Tea & Other Web2 Tool Users For now, Tea says it is reviewing its security practices and rebuilding user trust. But the breach highlights a larger industry problem. Platforms that promise anonymity and empowerment must treat data protection as a structural principle: not an optional feature. This incident may become a case study in why Web2 safety tools are insufficient for modern risks. Whether for dating, reputation, or whistleblowing, the next generation of platforms may need to be decentralized from the start. Tea promised safety. What it delivered was a case study in how trust breaks down in the Web2 era.

Google Warns This Email Means Your Gmail Is Under Attack

Forbes

an hour ago

Forbes

Google Warns This Email Means Your Gmail Is Under Attack

You do not want to get this email. With all the cyber security attacks compromising smartphones and PCs, it would be easy to conclude there's little you can do to stay safe. But the truth is very different. Most attacks are easily prevented with a few basic safeguards and some know-how. In reality, a number of simple changes can defend against most attacks. So it is with the FBI's two warnings this week. The first a resurgence of the Phantom Hacker attacks which trick PC users into installing rogue apps. And the second a raft of fake Chrome installs and updates which provide initial access for ransomware. If you just avoid installing linked apps in this way you will steer clear of those attacks. It's the same with a new Amazon impersonation attack that has surged 5000% in just two weeks. Don't click links in messages — even if they seem to come from Amazon. And now Gmail attack warnings are turning up again on social media, which will likely frustrate Google, because their advice has been clear but is not yet landing with users. The latest Gmail warnings come courtesy of a refreshed EasyDMARC article covering the 'no-reply' attacks from earlier this year, hijacking 'no-reply@ to trick users into clicking links and giving up their Google account sign-in credentials. Here again the advice is very simple. It shouldn't matter whether an email appears to come from Google. If it links to a sign-in page, it's an attack. Period. And that means any email that seems to come from Google but has a sign-in link must be deleted. 'Sometimes,' Google warns, 'hackers will copy Google's 'Suspicious sign-in prevented' emails and other official Google emails to try to steal someone's account information.' But the company tells all account holders that 'Google emails will never take you to a sign-in page. Authentic emails sent from Google to your Google Account will never ask you to sign in again to the account they were sent to.' It's as simple as that. Similarly, Google will never 'ask you to provide your password or other sensitive information by email or through a link, call you and ask for any forms of identification, including verification codes, send you a text message directing you to a sign-in page, or send a message via text or email asking you to forward a verification code.' With that in mind, you should not fall victim to these Google impersonation attacks, and if you stick to the basic rules on installs, links and attachments, then you'll likely stay safe from most of the other ones as well.

Pixel 10 Pro Price Leak Confirms Google's Bold Decisions

Forbes

an hour ago

Forbes

Pixel 10 Pro Price Leak Confirms Google's Bold Decisions

Pixel 9 Pro XL As Google prepares to announce the Pixel 10 and Pixel 10 Pro in late August, some of the biggest questions have been about the price of the new handsets. With flagship phones from the likes of Apple and Samsung regularly pushing base models into four-figure prices, how competitive will Google's pricing decisions be and what do they say about its strategy? Pixel 10 And Pixel 10 Pro Pricing We have European pricing across the four main models, which are broadly similar to the pricing of the Pixel 9 family. The Pixel 10 starts at €899 for 128 GB, the Pixel 10 Pro at €1,099 also for 128 GB, and the Pixel 10 Pro XL listings drop the 128 GB option to start at €1,299 for 256 GB. Finally, the Pixel 10 Pro Fold starts at € 1,899 for 256 GB. Both the Pro XL and the Pro Fold, sitting at the top of the portfolio, do not have 128 GB versions listed. That's not to say the 128 GB option is definitely gone—perhaps these will be available only through the Google Store, region-locked, or as network exclusives. We shall see during the Made By Google launch event scheduled for Aug. 20. By echoing 2024's pricing across the range, Google offers us a glimpse into the wider strategy it is using to address both the consumer market and its Android Partners. Pixel 10 And Pixel 10 Pro Strategy The steady approach to pricing—which is arguably mirrored by the incremental updates to the physical design of the Pixel 10 family over the Pixel 9—shows Google's confidence in the market. There may be new features added to the hardware and software of the Pixel package, but the value proposition has not changed; so neither has the price. That could be a quiet victory on retail shelves. There's an expectation that the price of the new model will rise compared to the existing model. Because the counterpoint to this is that if Google did lift the prices on the Pixel models, then the difference between its smartphones and the likes of the iPhone Pro or Galaxy S Ultra models would be more apparent; the increased performance of the latter would be noticeable, and the value conscious consumer would look at the closer pricing and decide that the jump to a more recognisable phone brand would be easier to make. Yet there is something more for the price-conscious buyer. Because of the lower starting price for each model, moving up to a model with increased storage feels like good value, in part due to keeping the Pixels lower than the entry-level pricing of the aforementioned Apple and Samsung devices. Finding Value In The Pixel 10 And Pixel 10 Pro Pricing for the Pixel 10 Pro and the rest of the Pixel 10 lineup is familiar. Google continues to offer a flagship experience at a price that will be perceived as lower than that of its competitors' flagships. Part of me wonders if this is because the real value in the Pixel family for Google is in the software and subscription services such as Google One and Google Gemini. Now read the latest Pixel 10 Pro, Samsung Galaxy, and smartphone news in Forbes' weekly Android news digest...

Crowdsourced AI benchmarks have serious flaws, some experts say

Hashtags

Try Our AI Features

Comments

Related Articles

Tea App Breach Reveals Why Web2 Can't Protect Sensitive Data

Google Warns This Email Means Your Gmail Is Under Attack

Pixel 10 Pro Price Leak Confirms Google's Bold Decisions

Get Started Now: Download the App