People are using Super Mario to benchmark AI now
Hao AI Lab, a research org at the University of California San Diego, on Friday threw AI into live Super Mario Bros. games. Anthropic's Claude 3.7 performed the best, followed by Claude 3.5. Google's Gemini 1.5 Pro and OpenAI's GPT-4o struggled.
It wasn't quite the same version of Super Mario Bros. as the original 1985 release, to be clear. The game ran in an emulator and integrated with a framework, GamingAgent, to give the AIs control over Mario.
GamingAgent, which Hao developed in-house, fed the AI basic instructions, like, "If an obstacle or enemy is near, move/jump left to dodge" and in-game screenshots. The AI then generated inputs in the form of Python code to control Mario.
Still, Hao says that the game forced each model to "learn" to plan complex maneuvers and develop gameplay strategies. Interestingly, the lab found that reasoning models like OpenAI's o1, which "think" through problems step by step to arrive at solutions, performed worse than "non-reasoning" models, despite being generally stronger on most benchmarks.
One of the main reasons reasoning models have trouble playing real-time games like this is that they take a while — seconds, usually — to decide on actions, according to the researchers. In Super Mario Bros., timing is everything. A second can mean the difference between a jump safely cleared and a plummet to your death.
Games have been used to benchmark AI for decades. But some experts have questioned the wisdom of drawing connections between AI's gaming skills and technological advancement. Unlike the real world, games tend to be abstract and relatively simple, and they provide a theoretically infinite amount of data to train AI.
The recent flashy gaming benchmarks point to what Andrej Karpathy, a research scientist and founding member at OpenAI, called an "evaluation crisis."
"I don't really know what [AI] metrics to look at right now," he wrote in a post on X. "TLDR my reaction is I don't really know how good these models are right now."
At least we can watch AI play Mario.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles
Yahoo
2 hours ago
- Yahoo
Top AI Researchers Concerned They're Losing the Ability to Understand What They've Created
Researchers from OpenAI, Google DeepMind, and Meta have joined forces to warn about what they're building. In a new position paper, 40 researchers spread across those four companies called for more investigation of AI powered by so-called "chains-of-thought" (CoT), the "thinking out loud" process that advanced "reasoning" models — the current vanguard of consumer-facing AI — use when they're working through a query. As those researchers acknowledge, CoTs add a certain transparency into the inner workings of AI, allowing users to see "intent to misbehave" or get stuff wrong as it happens. Still, there is "no guarantee that the current degree of visibility will persist," especially as models continue to advance. Depending on how they're trained, advanced models may no longer, the paper suggests, "need to verbalize any of their thoughts, and would thus lose the safety advantages." There's also the non-zero chance that models could intentionally "obfuscate" their CoTs after realizing that they're being watched, the researchers noted — and as we've already seen, AI has indeed rapidly become very good at lying and deception. To make sure this valuable visibility continues, the cross-company consortium is calling on developers to start figuring out what makes CoTs "monitorable," or what makes the models think out loud the way they do. In this request, those same researchers seem to be admitting something stark: that nobody is entirely sure why the models are "thinking" this way, or how long they will continue to do so. Zooming out from the technical details, it's worth taking a moment to consider how strange this situation is. Top researchers in an emerging field are warning that they don't quite understand how their creation works, and lack confidence in their ability to control it going forward, even as they forge ahead making it stronger; there's no clear precedent in the history of innovation, even looking back to civilization-shifting inventions like atomic energy and the combustion engine. In an interview with TechCrunch about the paper, OpenAI research scientist and paper coauthor Bowen Baker explained how he sees the situation. "We're at this critical time where we have this new chain-of-thought thing," Baker told the website. "It seems pretty useful, but it could go away in a few years if people don't really concentrate on it." "Publishing a position paper like this, to me, is a mechanism to get more research and attention on this topic," he continued, "before that happens." Once again, there appears to be tacit acknowledgement of AI's "black box" nature — and to be fair, even CEOs like OpenAI's Sam Altman and Anthropic's Dario Amodei have admitted that at a deep level, they don't really understand how the technology they're building works. Along with its 40-researcher author list that includes DeepMind cofounder Shane Legg and xAI safety advisor Dan Hendrycks, the paper has also drawn endorsements from industry luminaries including former OpenAI chief scientist Ilya Sutskever and AI godfather and Nobel laureate Geoffrey Hinton. Though Musk's name doesn't appear on the paper, with Hendrycks on board, all of the "Big Five" firms — OpenAI, Google, Anthropic, Meta, and xAI — have been brought together to warn about what might happen when and if AI stops showing its work. In doing so, that powerful cabal has said the quiet part out loud: that they don't feel entirely in control of AI's future. For companies with untold billions between them, that's a pretty strange message to market — which makes the paper all the more remarkable. More on AI warnings: Bernie Sanders Issues Warning About How AI Is Really Being Used Solve the daily Crossword


Bloomberg
5 hours ago
- Bloomberg
Anthropic Draws Investor Interest at More Than $100 Billion Valuation
OpenAI rival Anthropic is in the early stages of planning another investment round that could value the company at more than $100 billion, according to a person familiar with the matter. The artificial intelligence startup is not formally fundraising, said people with knowledge of the situation, who asked not to be identified discussing private information. But pre-emptive funding offers from VCs for the top artificial intelligence companies have become the norm in Silicon Valley, and investors have approached Anthropic indicating that they'd invest at a valuation over $100 billion, the people said.
Yahoo
6 hours ago
- Yahoo
OpenAI lists Google as cloud partner amid growing demand for computing capacity
(Reuters) -OpenAI has included Alphabet's Google Cloud among its suppliers to meet escalating demands for computing capacity, according to an updated list published on the ChatGPT maker's website. The artificial-intelligence giant also relies on services from Microsoft, Oracle, and CoreWeave. The deal with Google, finalized in May after months of discussions, was first reported by Reuters citing a source in June. The arrangement underscores how massive computing demands to train and deploy AI models are reshaping the competitive dynamics in AI, and marks OpenAI's latest move to diversify its compute sources beyond its major supporter Microsoft, including its high-profile Stargate data center project. Earlier this year, OpenAI partnered with SoftBank and Oracle on the $500 billion Stargate infrastructure program and signed multi-billion-dollar agreements with CoreWeave to bolster computing capacity. The partnership with Google is the latest of several maneuvers made by OpenAI to reduce its dependency on Microsoft whose Azure cloud service had served as the ChatGPT maker's exclusive data center infrastructure provider until January. Google and OpenAI discussed an arrangement for months but were previously blocked from signing a deal due to OpenAI's lock-in with Microsoft, a source had told Reuters. Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data