25-06-2025
Why AI Benchmarking Needs A Rethink
Siobhan Hanna, SVP and General Manager, Welo Data.
AI models are evolving at breakneck speed, but the methods for measuring their performance remain stagnant and the real-world consequences are significant. AI models that haven't been thoroughly tested could result in inaccurate conclusions, missed opportunities and costly errors.
As AI adoption accelerates, it's becoming clear that current testing frameworks fall short in assessing real-world reasoning capabilities—pointing to an urgent need for improved evaluation standards.
The Limitations Of Current AI Benchmarks
Traditional AI benchmarks are structured to evaluate basic tasks, such as factual recall and fluency, which are easy to measure. However, advanced capabilities like causal reasoning—the ability to identify cause-and-effect relationships—are more difficult to assess systematically despite their importance in everyday AI applications.
While most benchmarks are useful in gauging an AI model's capacity to process and reproduce information, they fail to assess whether the model is truly 'reasoning' or merely recognizing patterns from its training data. Understanding this distinction is crucial because, as S&P Global research notes, AI's reasoning ability directly impacts its applicability in tasks like problem-solving, decision making and generating insights that go beyond simple data retrieval.
Additionally, the prompts used to evaluate the majority of AI capabilities are primarily in English, neglecting the diverse linguistic and cultural contexts of the global marketplace. This limitation is especially relevant as AI models are increasingly deployed around the world, where the demands for accuracy and consistency are vital across languages, as recently discussed by Stanford Assistant Professor Sanmi Koyejo.
The Multilingual Blind Spot
Most datasets used for evaluating causal reasoning are designed with English as the primary language, leaving models' abilities to reason about cause-and-effect relationships in other languages largely untested.
Languages exhibit significant diversity in their grammatical structures, morphological systems and other linguistic features. If the models have not been sufficiently exposed to these differences, their ability to identify causality can be impacted.
My company, Welo Data, conducted an independent benchmarking study across over 20 large language models (LLMs) from 10 different developers, revealing just how significant this issue is. The evaluation used story-based prompts that required contextual reasoning to test advanced causal reasoning capabilities across languages, including English, Spanish, Japanese, Korean, Turkish and Arabic. The results: LLMs often struggled with these complex causal inference tasks, especially when tested in languages other than English
Many models showed inconsistent results when interpreting the same logical scenario in different languages. This inconsistency suggests that these models fail to account for linguistic differences in the way humans reason and convey causality. If AI is to be useful across diverse languages, benchmarking frameworks must evolve to test language model proficiency across linguistic boundaries.
The Causal Reasoning Gap
Causal reasoning is a crucial aspect of human intelligence, allowing us to understand what happens and why it happens. The study found that many AI models still struggle with this fundamental capability, particularly in multilingual contexts. While these models excel at pattern recognition, they often fail to effectively identify causal relationships in scenarios that require multistep reasoning.
This gap is a significant limitation when deploying these models in real-world scenarios, such as healthcare, finance or customer support, where accurate and nuanced decision making is critical. Existing benchmarks often simplify cause-and-effect scenarios to tasks that rely on well-established datasets or pre-defined solutions, making it difficult to determine whether the model is truly reasoning or simply reproducing learned patterns.
One promising direction involves using more complex, human-crafted testing scenarios designed to require genuine causal inference rather than pattern recognition. By incorporating such methods into evaluation frameworks, organizations can more clearly identify where models fall short—especially in multilingual or high-stakes applications—and take targeted steps to improve performance.
A New Way Forward: Evolving AI Testing For The Future
To truly understand how AI models will perform in functional settings, testing methodologies must assess the full range of cognitive abilities required in human-like reasoning. There are several ways companies can adopt better AI testing methodologies:
• Implement a multilingual approach. AI models must be evaluated across multiple languages to ensure they can handle the complexities of global communication. This is especially important for companies operating in diverse markets or serving international customers.
• Incorporate complex, real-world scenarios. Focus on evaluating AI through scenarios where multiple factors and variables interact, allowing for an accurate measurement of AI's capabilities.
• Emphasize causal reasoning with novel data. Prioritize assessing causal reasoning abilities using previously unseen scenarios and examples that require genuine understanding of cause-and-effect relationships. This ensures the AI is demonstrating true causal inference rather than pattern matching or recalling information from its training data.
Paving The Way: Building Better AI
Existing benchmarks often do not accurately assess the full range of AI's capabilities, which can leave businesses with incomplete or misleading information about how their AI models perform, depending on which benchmarks are used and their specific objectives.
As AI continues to evolve, so too must the methods used to evaluate its performance.
By adopting a comprehensive, multilingual and real-world testing approach, we can ensure that AI models are not only capable but also reliable and equitable across diverse languages and contexts. It's time to rethink AI benchmarking—and, with that, the future of AI itself.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?