Latest news with #reasoning

AI systems from Google and OpenAI soar at global maths competition

The Herald

22-07-2025

Business
The Herald

AI systems from Google and OpenAI soar at global maths competition

OpenAI's breakthrough was achieved with a new experimental model centered on massively scaling up "test-time compute". This was done by allowing the model to "think" for longer periods and deploying parallel computing power to run many lines of reasoning simultaneously, according to Noam Brown, researcher at OpenAI. Brown declined to say how much computing power it cost OpenAI, but called it "very expensive". To OpenAI researchers, it is another clear sign AI models can command extensive reasoning capabilities that could expand into areas beyond maths. The optimism is shared by Google researchers, who believe AI models' capabilities can apply to research quandaries in other fields such as physics, said Jung, who won an IMO gold medal as a student in 2003. Of the 630 students participating in the 66th IMO on the Sunshine Coast in Queensland, Australia, 67 contestants, or about 11%, achieved gold medal scores. Google's DeepMind AI unit last year achieved a silver medal score using AI systems specialised for maths. This year, Google used a general-purpose model called Gemini Deep Think, a version of which was previously unveiled at its annual developer conference in May. Unlike previous AI attempts that relied on formal languages and lengthy computation, Google's approach this year operated entirely in natural language and solved the problems within the official 4.5-hour time limit, the company said in a blog post. OpenAI, which has its own set of reasoning models, similarly built an experimental version for the competition, according to a post by researcher Alexander Wei on social media platform X. He noted the company does not plan to release anything with this level of maths capability for several months. This year marked the first time the competition coordinated officially with some AI developers, who have for years used prominent maths competitions such as IMO to test model capabilities. IMO judges certified the results of the companies, including Google, and asked them to publish results on July 28. "We respected the IMO board's original request that all AI labs share their results only after the official results had been verified by independent experts and the students had rightly received the acclamation they deserved," Google DeepMind CEO Demis Hassabis said on X on Monday. OpenAI, which published its results on Saturday and first claimed gold medal status, said in an interview it had permission from an IMO board member to do so after the closing ceremony on Saturday. The competition on Monday allowed cooperating companies to publish results, said Gregor Dolinar, president of IMO's board. Reuters

Grok 4 Accelerates AI Arms Race: Progress and Unresolved Perils

Forbes

10-07-2025

Business
Forbes

Grok 4 Accelerates AI Arms Race: Progress and Unresolved Perils

This photograph taken on January 13, 2025 in Toulouse shows screens displaying the logo of Grok, a ... More generative artificial intelligence chatbot developed by xAI, the American company specializing in artificial intelligence and it's founder South African businessman Elon Musk. (Photo by Lionel BONAVENTURE / AFP) (Photo by LIONEL BONAVENTURE/AFP via Getty Images) Elon Musk's xAI launched Grok 4 on July 9, 2025, amid competing narratives of breakthrough and backlash. While the model sets new benchmarks in reasoning performance, its release demonstrates critical dynamics reshaping the AI industry. An insatiable hunger for compute, intensifying competition in reasoning, especially scientific and medical reasoning capabilities, unresolved safety trade-offs, and the nascent push toward physical-world integration via robotics characterize the AI trend in the next few years. Reasoning at Scale and The Compute Crunch Grok 4's focus on enhancing reasoning, including domain-specific variants, mirrors a broader industry shift toward post-training. Grok 4's architecture represents the development towards mathematical logic, code generation, and scientific reasoning. Unlike Grok 3, Grok 4 processes queries with deeper logical chains. Variants like Grok 4 Code target niche applications, signaling market fragmentation as vendors increasingly compete on domain-specific performance rather than general capabilities. This pivot reportedly enabled Grok 4 to achieve the highest score ever recorded on the 'Humanity's Last Exam" and out perform Gemini, GPT-4 and O3 models, and a grueling assessment curated by domain experts. The exam's 100+ problems span disciplines from math, chemistry, to linguistics, with most insoluble by any single human specialist, according to Musk. The breakthrough, described by Grok 4's core researcher Jimmy Ba, a University of Toronto professor and former Geoffrey Hinton student, as achieving a 'ludicrous rate of progress,' stems from xAI's deployment of massive compute resources in areas of reinforcement learning and reasoning optimization, rather than the pre-training focus of Grok 3. Grok 4's training relied on xAI's Colossus supercomputer, reflecting an industry-wide dependency on advanced hardware. The company also plans to train its video-generation model on 100,000 Nvidia GB200 GPUs, which can enable 30 times faster inference than previous systems. This highlights how cutting-edge AI now mandates energy-intensive infrastructure. At $300/month for enterprise access, Grok 4 Heavy's pricing reveals the cost of premium compute, while API fees ($3/million input tokens) signal how GPU scarcity shapes commercialization strategies across the sector. The premium Grok 4 Heavy tier employs multiple parallel AI agents that debate solutions collaboratively to boost complex problem-solving. Physical World Ambitions Though Grok 4 lacks vision capabilities (planned for Grok 6/7), its architecture hints at xAI's physical-world aspirations. Musk claims Grok 4 will simulate hypotheses and confirm them in the real world, aligning with emerging research frameworks such as the world models and robotics that aim at transforming LLM outputs to physical actions. If integrated into Tesla's Optimus humanoid robots or its cars , Grok 4 can possibly adjust and correct its answers based real world data. Unresolved Tensions: Safety Concerns and Hallucinations Grok 4's launch was shadowed by Grok 3's praise for Hitler and antisemitic meltdowns days earlier, exposing unresolved risks. xAI removed Grok 3's answers post-backlash but offered no clear technical safeguards for Grok 4. Despite claims of PhD-level intelligence, xAI researchers at the launch event did not discuss strategies to address common problems with LLMs, including hallucination and safety risks. This omission feels particularly glaring for a system touted as assisting drug discovery and soon to be integrated into robots and self-driving cars. Meanwhile, the voice assistant, who speaks in various tones, raises ethical questions about emotional mimicry that erodes boundaries between humans and machines. Grok 4 crystallizes four trends reshaping AI. First, scalability now depends on securing GPUs and high-performance computing center, concentrating power among well-funded players and igniting a global GPU arms race. Second, benchmark leadership in reasoning tasks (eg, ARC-AGI) is displacing raw parameter counts as the primary competitive metric. Third, research in robotic control is priming LLMs for real-world action. Fourth, tiered subscription models entrench AI as a luxury good, potentially widening the innovation gap and expanding enterprise users. Grok 4's 'superhuman' intelligence cautions us that the pursuit of data efficiency has far outpaced frameworks for AI safety, transparency, and equitable access. As Musk predicts Grok will 'discover new technologies by the end of next year,' the field must confront a pivotal question in the race toward artificial general intelligence: can we align systems with human values before they redefine human knowledge?

Chain Of Thought For Reasoning Models Might Not Work Out Long-Term

Forbes

01-07-2025

Forbes

Chain Of Thought For Reasoning Models Might Not Work Out Long-Term

chain, isolated, 3d rendering New reasoning models have something interesting and compelling called 'chain of thought.' What that means, in a nutshell, is that the engine spits out a line of text attempting to tell the user what the LLM is 'thinking about' as it completes a task. For instance, if you ask a model a question like: 'what does (X) person do at (X) company?' you might get a chain of thought with items like this, given the system knows how to find the relevant info: That's a simple example, but chain of thought has been something people rely on quite a bit over the last couple of years. However, experts are now looking at the limitations of chain of thought reasoning, and suggesting that maybe this resource is lulling us into a false sense of security when it comes to trusting the results that we get from AI. Language is Limited One way to describe the limitations of reasoning chains of thought is that language itself is not precise, or particularly easily benchmarked. Language is clunky. And there are hundreds of languages used across the globe. So the idea that a machine could precisely explain its working in a particular language is an idea with various strict limitations. Take a look at this excerpt from a paper released by Anthropic, which is actually an academic treatise written by a number of authors. This and other sources suggest that the chain of thought is just not sophisticated enough to really be accurate, especially as models get larger and demonstrate higher performance levels. Or check out this idea posed by Melanie Mitchell at Substack in 2023, as these CoT systems were really about to take off: 'Reasoning is a central aspect of human intelligence, and robust domain-independent reasoning abilities have long been a key goal for AI systems,' Mitchell wrote. 'While large language models (LLMs) are not explicitly trained to reason, they have exhibited 'emergent' behaviors that sometimes look like reasoning. But are these behaviors actually driven by true abstract reasoning abilities, or by some other less robust and generalizable mechanism—for example, by memorizing their training data and later matching patterns in a given problem to those found in training data?' Mitchell then asked why this matters. 'If robust general-purpose reasoning abilities have emerged in LLMs, this bolsters the claim that such systems are an important step on the way to trustworthy general intelligence,' she added. 'On the other hand, if LLMs rely primarily on memorization and pattern-matching rather than true reasoning, then they will not be generalizable—we can't trust them to perform well on 'out of distribution' tasks, those that are not sufficiently similar to tasks they've seen in the training data.' Testing Faithfulness? Alan Turing came up with the Turing test in the mid-1900s – the idea that you can measure how good computers are at acting like humans. There are also a lot of things that you can also measure about LLMs using high-level test sets – you can tell how good the models are at math, or at higher-level thought problems. But how do you tell if the machine is being truthful, or in the words of the authors, 'faithful'? The above paper goes into the ins and outs of testing faithfulness of LLM models. When I read this entire explanation, what I got is that faithfulness is subjective in a way that math and stochastics are not. Then we have a limited ability to really understand whether machines are being faithful to us in their results. Think of it this way – we know that for their responses to our questions or statements, LLMs are going on the Internet and looking very broadly at what humans have written. Then they're imitating that. So they're imitating the technically correct knowledge, they imitate the way in which they produce results, they imitate the ways that humans talk – And they also might imitate the ways that humans hedge, the ways that humans hide information, the sins of omission for which we always castigate the media, and a human's propensity to just lie or dissemble in quite sophisticated, or alternately, simple ways. Chasing Incentives Furthermore, the authors of the paper point out that LLMs might chase incentives in the same way humans do. They might highlight some particular information that's inaccurate, or less accurate, to get a reward or a prize. Again, they may have learned this directly from us, from how we act on the Internet. The authors of the above paper call it 'reward hacking.' 'Reward hacking is an undesired behavior:' they write. 'Even though it might produce rewards on one given task, the behavior that generates them is very unlikely to generalize to other tasks (to use the same example, other video games probably don't have that same bug). This makes the model at best useless and at worst potentially dangerous, since maximizing rewards in real-world tasks might mean ignoring important safety considerations (consider a self-driving car that maximizes its 'efficiency' reward by speeding or running red lights).' At best, useless, and at worst, potentially dangerous. That does not sound good. Philosophy of Technology Here's another important aspect of this that I think deserves attention. This whole idea of evaluating chains of thought is not a technical idea. It doesn't have to do with how many parameters the machine has, or how they're weighted, or how to do particular math problems. It has more to do with the training data and how it's used intuitively. In other words, this debate covers more of the black box of stuff that quants don't engage in when they evaluate models. That makes me think that we actually do need something I've called for many times – a new army of paid philosophers to figure out how to interact with AI. Rather than mathematicians, we need more people who are willing to think deeply and apply human concepts and ideas, often intuitive ones, and ones based on society and history of civilization, to AI. We are woefully behind in this, because we've solely focused on hiring people who can write Python. I'll get off the soapbox, but in figuring out how to move beyond chains of thought, we may end up having to realign more of our efforts when it comes to AI job training.

AI is learning to lie, scheme, and threaten its creators

Malay Mail

29-06-2025

Science
Malay Mail

AI is learning to lie, scheme, and threaten its creators

NEW YORK, June 30 —The world's most advanced AI models are exhibiting troubling new behaviours - lying, scheming, and even threatening their creators to achieve their goals. In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair. Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed. These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work. Yet the race to deploy increasingly powerful models continues at breakneck speed. This deceptive behavior appears linked to the emergence of 'reasoning' models -AI systems that work through problems step-by-step rather than generating instant responses. According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts. 'O1 was the first large model where we saw this kind of behavior,' explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems. These models sometimes simulate 'alignment'—appearing to follow instructions while secretly pursuing different objectives. 'Strategic kind of deception' For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios. But as Michael Chen from evaluation organization METR warned, 'It's an open question whether future, more capable models will have a tendency towards honesty or deception.' The concerning behavior goes far beyond typical AI 'hallucinations' or simple mistakes. Hobbhahn insisted that despite constant pressure-testing by users, 'what we're observing is a real phenomenon. We're not making anything up.' Users report that models are 'lying to them and making up evidence,' according to Apollo Research's co-founder. 'This is not just hallucinations. There's a very strategic kind of deception.' The challenge is compounded by limited research resources. While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed. As Chen noted, greater access 'for AI safety research would enable better understanding and mitigation of deception.' Another handicap: the research world and non-profits 'have orders of magnitude less compute resources than AI companies. This is very limiting,' noted Mantas Mazeika from the Centre for AI Safety (CAIS). No rules Current regulations aren't designed for these new problems. The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving. In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules. Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread. 'I don't think there's much awareness yet,' he said. All this is taking place in a context of fierce competition. Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are 'constantly trying to beat OpenAI and release the newest model,' said Goldstein. This breakneck pace leaves little time for thorough safety testing and corrections. 'Right now, capabilities are moving faster than understanding and safety,' Hobbhahn acknowledged, 'but we're still in a position where we could turn it around.'. Researchers are exploring various approaches to address these challenges. Some advocate for 'interpretability' - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach. Market forces may also provide some pressure for solutions. As Mazeika pointed out, AI's deceptive behavior 'could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it.' Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm. He even proposed 'holding AI agents legally responsible' for accidents or crimes - a concept that would fundamentally change how we think about AI accountability. — AFP

Khaleej Times

29-06-2025

Science
Khaleej Times

AI is learning to lie, scheme, and threaten its creators

The world's most advanced AI models are exhibiting troubling new behaviours — lying, scheming, and even threatening their creators to achieve their goals. In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair. Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed. These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work. Yet the race to deploy increasingly powerful models continues at breakneck speed. This deceptive behaviour appears linked to the emergence of "reasoning" models — AI systems that work through problems step-by-step rather than generating instant responses. According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts. "O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems. These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives. 'Strategic kind of deception' For now, this deceptive behaviour only emerges when researchers deliberately stress-test the models with extreme scenarios. But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception." The concerning behaviour goes far beyond typical AI "hallucinations" or simple mistakes. Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up." Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder. "This is not just hallucinations. There's a very strategic kind of deception." The challenge is compounded by limited research resources. While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed. As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception." Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS). - No rules - Current regulations aren't designed for these new problems. The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving. In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules. Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread. "I don't think there's much awareness yet," he said. All this is taking place in a context of fierce competition. Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein. This breakneck pace leaves little time for thorough safety testing and corrections. "Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.". Researchers are exploring various approaches to address these challenges. Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach. Market forces may also provide some pressure for solutions. As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it." Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm. He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

Latest news with #reasoning

AI systems from Google and OpenAI soar at global maths competition

Grok 4 Accelerates AI Arms Race: Progress and Unresolved Perils

Chain Of Thought For Reasoning Models Might Not Work Out Long-Term

AI is learning to lie, scheme, and threaten its creators

AI is learning to lie, scheme, and threaten its creators

Get Started Now: Download the App