Daily 8 | SamyBengio News

AI experts divided over Apple's research on large reasoning model accuracy

Business Standard

15-06-2025

Business
Business Standard

AI experts divided over Apple's research on large reasoning model accuracy

A recent study by tech giant Apple claiming that the accuracy of frontier large reasoning models (LRMs) declines as task complexity increases, and eventually collapses altogether, has led to differing views among experts in the artificial intelligence (AI) world. The paper titled 'The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity' was published by Apple last week. Apple, in its paper, said it conducted experiments across diverse puzzles which show that such LRMs face a complete accuracy collapse beyond certain complexities. While their reasoning efforts increase with the complexity of a problem till a point, it then declines despite having an adequate token budget. A token budget for large language models (LLM) refers to the practice of setting a limit on the number of tokens an LLM can use for a specific task. The paper is co-authored by Samy Bengio, senior director, AI and ML research at Apple who is also the brother of Yoshua Bengio, often referred to as the godfather of AI. Meanwhile, AI company Anthropic, backed by Amazon, countered Apple's claims in a separate paper, saying that the 'findings primarily reflect experimental design limitations rather than fundamental reasoning failures.' 'Their central finding has significant implications for AI reasoning research. However, our analysis reveals that these apparent failures stem from experimental design choices rather than inherent model limitations,' it said. Mayank Gupta, founder of Swift Anytime, currently building an AI product on stealth, told Business Standard that both sides have equally important points. 'What this tells me is that we're still figuring out how to measure reasoning in LRMs the right way. The models are improving rapidly, but our evaluation tools haven't caught up. We need tools that separate how well an LRM reasons from how well it generates output and that's where the real breakthrough lies,' he said. Gary Marcus, a US academic, who has become a voice of caution on the capabilities of AI models, said in a best case scenario, these models can write python code, supplementing their own weaknesses with outside symbolic code, but even this is not reliable. 'What this means for business and society is that you can't simply drop o3 or Claude into some complex problem and expect it to work reliably,' he wrote in his blog, Marcus on AI. The Apple researchers conducted experiments comparing thinking and non-thinking model pairs across controlled puzzle environments. 'The most interesting regime is the third regime where problem complexity is higher and the performance of both models have collapsed to zero. Results show that while thinking models delay this collapse, they also ultimately encounter the same fundamental limitations as their non-thinking counterparts,' they wrote. Apple's observations in the paper perhaps can explain why the iPhone maker has been slow to embed AI across its products or operating systems, a point on which it was criticised at the Worldwide Developers Conference (WWDC) last week. This approach is opposite to the ones adopted by Microsoft-backed OpenAI, Meta, and Google, who are spending billions to build more sophisticated frontier models to solve more complex tasks. However, there are other voices too who believe that Apple's paper has its limitations. Ethan Mollick, associate professor at the Wharton School who studies the effects of AI on work, entrepreneurship, and education, mentioned on X that while the limits of reasoning models are useful, it is premature to say that LLMs are hitting a wall.

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

Yahoo

09-06-2025

Yahoo

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

Researchers at Apple have released an eyebrow-raising paper that throws cold water on the "reasoning" capabilities of the latest, most powerful large language models. In the paper, a team of machine learning experts makes the case that the AI industry is grossly overstating the ability of its top AI models, including OpenAI's o3, Anthropic's Claude 3.7, and Google's Gemini. In particular, the researchers assail the claims of companies like OpenAI that their most advanced models can now "reason" — a supposed capability that the Sam Altman-led company has increasingly leaned on over the past year for marketing purposes — which the Apple team characterizes as merely an "illusion of thinking." It's a particularly noteworthy finding, considering Apple has been accused of falling far behind the competition in the AI space. The company has chosen a far more careful path to integrating the tech in its consumer-facing products — with some seriously mixed results so far. In theory, reasoning models break down user prompts into pieces and use sequential "chain of thought" steps to arrive at their answers. But now, Apple's own top minds are questioning whether frontier AI models simply aren't as good at "thinking" as they're being made out to be. "While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood," the team wrote in its paper. The authors — who include Samy Bengio, the director of Artificial Intelligence and Machine Learning Research at the software and hardware giant — argue that the existing approach to benchmarking "often suffers from data contamination and does not provide insights into the reasoning traces' structure and quality." By using "controllable puzzle environments," the team estimated the AI models' ability to "think" — and made a seemingly damning discovery. "Through extensive experimentation across diverse puzzles, we show that frontier [large reasoning models] face a complete accuracy collapse beyond certain complexities," they wrote. Thanks to a "counter-intuitive scaling limit," the AIs' reasoning abilities "declines despite having an adequate token budget." Put simply, even with sufficient training, the models are struggling with problem beyond a certain threshold of complexity — the result of "an 'overthinking' phenomenon," in the paper's phrasing. The finding is reminiscent of a broader trend. Benchmarks have shown that the latest generation of reasoning models is more prone to hallucinating, not less, indicating the tech may now be heading in the wrong direction in a key way. Exactly how reasoning models choose which path to take remains surprisingly murky, the Apple researchers found. "We found that LRMs have limitations in exact computation," the team concluded in its paper. "They fail to use explicit algorithms and reason inconsistently across puzzles." The researchers claim their findings raise "crucial questions" about the current crop of AI models' "true reasoning capabilities," undercutting a much-hyped new avenue in the burgeoning industry. That's despite tens of billions of dollars being poured into the tech's development, with the likes of OpenAI, Google, and Meta, constructing enormous data centers to run increasingly power-hungry AI models. Could the Apple researchers' finding be yet another canary in the coalmine, suggesting the tech has "hit a wall"? Or is the company trying to hedge its bets, calling out its outperforming competition as it lags behind, as some have suggested? It's certainly a surprising conclusion, considering Apple's precarious positioning in the AI industry: at the same time that its researchers are trashing the tech's current trajectory, it's promised a suite of Apple Intelligence tools for its devices like the iPhone and MacBook. "These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning," the paper reads. More on AI models: Car Dealerships Are Replacing Phone Staff With AI Voice Agents

Yahoo

09-06-2025

Yahoo

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

Researchers at Apple have released an eyebrow-raising paper that throws cold water on the "reasoning" capabilities of the latest, most powerful large language models. In the paper, a team of machine learning experts makes the case that the AI industry is grossly overstating the ability of its top AI models, including OpenAI's o3, Anthropic's Claude 3.7, and Google's Gemini. In particular, the researchers assail the claims of companies like OpenAI that their most advanced models can now "reason" — a supposed capability that the Sam Altman-led company has increasingly leaned on over the past year for marketing purposes — which the Apple team characterizes as merely an "illusion of thinking." It's a particularly noteworthy finding, considering Apple has been accused of falling far behind the competition in the AI space. The company has chosen a far more careful path to integrating the tech in its consumer-facing products — with some seriously mixed results so far. In theory, reasoning models break down user prompts into pieces and use sequential "chain of thought" steps to arrive at their answers. But now, Apple's own top minds are questioning whether frontier AI models simply aren't as good at "thinking" as they're being made out to be. "While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood," the team wrote in its paper. The authors — who include Samy Bengio, the director of Artificial Intelligence and Machine Learning Research at the software and hardware giant — argue that the existing approach to benchmarking "often suffers from data contamination and does not provide insights into the reasoning traces' structure and quality." By using "controllable puzzle environments," the team estimated the AI models' ability to "think" — and made a seemingly damning discovery. "Through extensive experimentation across diverse puzzles, we show that frontier [large reasoning models] face a complete accuracy collapse beyond certain complexities," they wrote. Thanks to a "counter-intuitive scaling limit," the AIs' reasoning abilities "declines despite having an adequate token budget." Put simply, even with sufficient training, the models are struggling with problem beyond a certain threshold of complexity — the result of "an 'overthinking' phenomenon," in the paper's phrasing. The finding is reminiscent of a broader trend. Benchmarks have shown that the latest generation of reasoning models is more prone to hallucinating, not less, indicating the tech may now be heading in the wrong direction in a key way. Exactly how reasoning models choose which path to take remains surprisingly murky, the Apple researchers found. "We found that LRMs have limitations in exact computation," the team concluded in its paper. "They fail to use explicit algorithms and reason inconsistently across puzzles." The researchers claim their findings raise "crucial questions" about the current crop of AI models' "true reasoning capabilities," undercutting a much-hyped new avenue in the burgeoning industry. That's despite tens of billions of dollars being poured into the tech's development, with the likes of OpenAI, Google, and Meta, constructing enormous data centers to run increasingly power-hungry AI models. Could the Apple researchers' finding be yet another canary in the coalmine, suggesting the tech has "hit a wall"? Or is the company trying to hedge its bets, calling out its outperforming competition as it lags behind, as some have suggested? It's certainly a surprising conclusion, considering Apple's precarious positioning in the AI industry: at the same time that its researchers are trashing the tech's current trajectory, it's promised a suite of Apple Intelligence tools for its devices like the iPhone and MacBook. "These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning," the paper reads. More on AI models: Car Dealerships Are Replacing Phone Staff With AI Voice Agents

Latest news with #SamyBengio

AI experts divided over Apple's research on large reasoning model accuracy

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

Get Started Now: Download the App