Latest news with #PersonQA

Why is AI halllucinating more frequently, and how can we stop it?

Yahoo

21-06-2025

Science
Yahoo

Why is AI halllucinating more frequently, and how can we stop it?

When you buy through links on our articles, Future and its syndication partners may earn a commission. The more advanced artificial intelligence (AI) gets, the more it "hallucinates" and provides incorrect and inaccurate information. Research conducted by OpenAI found that its latest and most powerful reasoning models, o3 and o4-mini, hallucinated 33% and 48% of the time, respectively, when tested by OpenAI's PersonQA benchmark. That's more than double the rate of the older o1 model. While o3 delivers more accurate information than its predecessor, it appears to come at the cost of more inaccurate hallucinations. This raises a concern over the accuracy and reliability of large language models (LLMs) such as AI chatbots, said Eleanor Watson, an Institute of Electrical and Electronics Engineers (IEEE) member and AI ethics engineer at Singularity University. "When a system outputs fabricated information — such as invented facts, citations or events — with the same fluency and coherence it uses for accurate content, it risks misleading users in subtle and consequential ways," Watson told Live Science. Related: Cutting-edge AI models from OpenAI and DeepSeek undergo 'complete collapse' when problems get too difficult, study reveals The issue of hallucination highlights the need to carefully assess and supervise the information AI systems produce when using LLMs and reasoning models, experts say. The crux of a reasoning model is that it can handle complex tasks by essentially breaking them down into individual components and coming up with solutions to tackle them. Rather than seeking to kick out answers based on statistical probability, reasoning models come up with strategies to solve a problem, much like how humans think. In order to develop creative, and potentially novel, solutions to problems, AI needs to hallucinate —otherwise it's limited by rigid data its LLM ingests. "It's important to note that hallucination is a feature, not a bug, of AI," Sohrob Kazerounian, an AI researcher at Vectra AI, told Live Science. "To paraphrase a colleague of mine, 'Everything an LLM outputs is a hallucination. It's just that some of those hallucinations are true.' If an AI only generated verbatim outputs that it had seen during training, all of AI would reduce to a massive search problem." "You would only be able to generate computer code that had been written before, find proteins and molecules whose properties had already been studied and described, and answer homework questions that had already previously been asked before. You would not, however, be able to ask the LLM to write the lyrics for a concept album focused on the AI singularity, blending the lyrical stylings of Snoop Dogg and Bob Dylan." In effect, LLMs and the AI systems they power need to hallucinate in order to create, rather than simply serve up existing information. It is similar, conceptually, to the way that humans dream or imagine scenarios when conjuring new ideas. However, AI hallucinations present a problem when it comes to delivering accurate and correct information, especially if users take the information at face value without any checks or oversight. "This is especially problematic in domains where decisions depend on factual precision, like medicine, law or finance," Watson said. "While more advanced models may reduce the frequency of obvious factual mistakes, the issue persists in more subtle forms. Over time, confabulation erodes the perception of AI systems as trustworthy instruments and can produce material harms when unverified content is acted upon." And this problem looks to be exacerbated as AI advances. "As model capabilities improve, errors often become less overt but more difficult to detect," Watson noted. "Fabricated content is increasingly embedded within plausible narratives and coherent reasoning chains. This introduces a particular risk: users may be unaware that errors are present and may treat outputs as definitive when they are not. The problem shifts from filtering out crude errors to identifying subtle distortions that may only reveal themselves under close scrutiny." Kazerounian backed this viewpoint up. "Despite the general belief that the problem of AI hallucination can and will get better over time, it appears that the most recent generation of advanced reasoning models may have actually begun to hallucinate more than their simpler counterparts — and there are no agreed-upon explanations for why this is," he said. The situation is further complicated because it can be very difficult to ascertain how LLMs come up with their answers; a parallel could be drawn here with how we still don't really know, comprehensively, how a human brain works. In a recent essay, Dario Amodei, the CEO of AI company Anthropic, highlighted a lack of understanding in how AIs come up with answers and information. "When a generative AI system does something, like summarize a financial document, we have no idea, at a specific or precise level, why it makes the choices it does — why it chooses certain words over others, or why it occasionally makes a mistake despite usually being accurate," he wrote. The problems caused by AI hallucinating inaccurate information are already very real, Kazerounian noted. "There is no universal, verifiable, way to get an LLM to correctly answer questions being asked about some corpus of data it has access to," he said. "The examples of non-existent hallucinated references, customer-facing chatbots making up company policy, and so on, are now all too common." Both Kazerounian and Watson told Live Science that, ultimately, AI hallucinations may be difficult to eliminate. But there could be ways to mitigate the issue. Watson suggested that "retrieval-augmented generation," which grounds a model's outputs in curated external knowledge sources, could help ensure that AI-produced information is anchored by verifiable data. "Another approach involves introducing structure into the model's reasoning. By prompting it to check its own outputs, compare different perspectives, or follow logical steps, scaffolded reasoning frameworks reduce the risk of unconstrained speculation and improve consistency," Watson, noting this could be aided by training to shape a model to prioritize accuracy, and reinforcement training from human or AI evaluators to encourage an LLM to deliver more disciplined, grounded responses. RELATED STORIES —AI benchmarking platform is helping top companies rig their model performances, study claims —AI can handle tasks twice as complex every few months. What does this exponential growth mean for how we use it? —What is the Turing test? How the rise of generative AI may have broken the famous imitation game "Finally, systems can be designed to recognise their own uncertainty. Rather than defaulting to confident answers, models can be taught to flag when they're unsure or to defer to human judgement when appropriate," Watson added. "While these strategies don't eliminate the risk of confabulation entirely, they offer a practical path forward to make AI outputs more reliable." Given that AI hallucination may be nearly impossible to eliminate, especially in advanced models, Kazerounian concluded that ultimately the information that LLMs produce will need to be treated with the "same skepticism we reserve for human counterparts."

Contradictheory: AI and the next generation

The Star

03-06-2025

Lifestyle
The Star

Contradictheory: AI and the next generation

Here's a conversation I don't think we'd have heard five years ago: 'You know what they do? They send in their part of the work, and it's so obviously ChatGPT. I had to rewrite the whole thing!' This wasn't a chat I had with the COO of some major company but with a 12-year-old child. She was talking about a piece of group work they had to do for class. And this Boy as she called him (you could hear the capitalised italics in her voice) had waited until the last minute to submit his part. To be honest, I shouldn't be surprised. These days, lots of people use AI in their work. It's normal. According to the 2024 Work Trend Index released by Microsoft and LinkedIn, 75% of employees then used artificial intelligence (AI) to save time and focus on their most important tasks. But it's not without its problems. An adult using AI to help draft an email is one thing. A student handing in their weekly assignment is another. The adult uses AI to communicate more clearly, but there the student is taking a shortcut. So, in an effort to deliver better work, the child might actually be learning less. And it's not going away. A 2024 study by Impact Research for the Walton Family Foundation found that 48% of students use ChatGPT at least weekly, representing a jump of 27 percentage points over 2023. And more students use AI chatbots to write essays and assignments (56%) than to study for tests and quizzes (52%). So what about the other students that don't use AI, like the girl I quoted above? I find they often take a rather antagonistic view. Some kids I talk to (usually the ones already doing well in class) seem to look down on classmates who use AI and, in the process, they look down on AI to do their homework as well. And I think that's wrong. As soon as I learned about ChatGPT, I felt that the key to using AI tools well is obvious. It lies in its name: tools. Like a ruler for drawing straight lines, or a dictionary for looking up words, AI chatbots are tools, only more incredibly versatile ones. One of the biggest problems, of course, is that AI chatbots don't always get their facts right (in AI parlance, they 'hallucinate'). So if you ask it for an essay on 'fastest marine mammal', there's a chance it'll include references to 'sailfish' and 'peregrine falcon'. In one test of AI chatbots, hallucination rates for newer AI systems were as high as 79%. Even OpenAI, the company behind ChatGPT, isn't immune. Their o3 release hallucinated 33% of the time in their PersonQA benchmark test, which measures how well it answers questions about public figures. The new o4-mini performed even worse, hallucinating 48% of the time. There are ways to work around this, but I think most people don't know them. For example, many chatbots now have a 'Deep Research' mode that actively searches the internet and presents answers along with sources. The thing about this is that you, the reasonable, competent, and capable human being, can check the original source to see if it's something you trust. Instead of the machine telling you what it 'knows', it tells you what it found, and it's up to you to verify it. Another method is to feed the chatbot the materials you want it to use, like a PDF of your textbook or a research paper. Google's NotebookLM is designed for this. It only works with the data you supply, drastically reducing hallucinations. You can then be more sure of the information it produces. In one stroke, you've turned the chatbot into a hyper-intelligent search engine that not only finds what you're looking for but also understands context, identifies patterns, and helps organise the information. That's just a small part of what AI can do. But even just helping students find and organise information better is a huge win. And ideally, teachers should lead the charge in classrooms, guiding students on how to work with AI responsibly and effectively. Instead, many feel compelled to ban it or to try to 'AI-proof' assignments, for example, by demanding handwritten submissions or choosing topics that chatbots are more likely to hallucinate on. But we can do better. We should allow AI in and teach students how to use it in a way that makes them better. For example, teachers could say that the 'slop' AI generates is the bare minimum. Hand it in as-is, and you'll scrape a C or D. But if you use it to refine your thoughts, to polish your voice, to spark better ideas, then that's where the value lies. And students can use it to help them revise by getting it to generate quizzes to test themselves with (they, of course, have to verify the answers the AI gives are correct). Nevertheless, what I've written about so far is about using AI as a tool. The future is about using it as a collaborator. Right now, according to the 2025 Microsoft Work Trend Index, while 50% see it as a command-based tool, 48% of Malaysian workers treat AI as a thought partner. The former issues basic instructions, while the latter has conversations and you have human-machine collaboration. The report goes on to say explicitly that this kind of partnership is what all employees should strive for when working with AI. That means knowing how to iterate the output given, when to delegate, when to refine the results, and when to push back. In short: the same skills we want kids to learn anyway when working with classmates and teachers. And the truth is that while I've used AI to find data, summarise reports, and – yes – to proofread this article, I haven't yet actively collaborated with AI. However, the future seems to be heading in that direction. Just a few weeks ago, I wrote about mathematician Terence Tao who predicts that it won't be long until computer proof assistants powered by AI may be cited as co-authors on mathematics papers. Clearly, I still have a lot to learn about using AI day-to-day. And it's hard. It involves trial and error and wasted effort while battling with looming deadlines. I may deliver inferior work in the meantime that collaborators may have to rewrite. But I remain, as ever, optimistic. Because technology – whether as a tool or a slightly eccentric collaborator – has ultimately the potential to make us and our work better. Logic is the antithesis of emotion but mathematician-turned-scriptwriter Dzof Azmi's theory is that people need both to make sense of life's vagaries and contradictions. Write to Dzof at lifestyle@ The views expressed here are entirely the writer's own.

OpenAI's o3 and o4-mini hallucinate way higher than previous models

Yahoo

20-05-2025

Yahoo

OpenAI's o3 and o4-mini hallucinate way higher than previous models

By OpenAI's own testing, its newest reasoning models, o3 and o4-mini, hallucinate significantly higher than o1. First reported by TechCrunch, OpenAI's system card detailed the PersonQA evaluation results, designed to test for hallucinations. From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent — almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often. SEE ALSO: All the AI news of the week: ChatGPT debuts o3 and o4-mini, Gemini talks to dolphins The system card noted how o3 "tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims." But OpenAI doesn't know the underlying cause, simply saying, "More research is needed to understand the cause of this result." OpenAI's reasoning models are billed as more accurate than its non-reasoning models like GPT-4o and GPT-4.5 because they use more computation to "spend more time thinking before they respond," as described in the o1 announcement. Rather than largely relying on stochastic methods to provide an answer, the o-series models are trained to "refine their thinking process, try different strategies, and recognize their mistakes." However, the system card for GPT-4.5, which was released in February, shows a 19 percent hallucination rate on the PersonQA evaluation. The same card also compares it to GPT-4o, which had a 30 percent hallucination rate. In a statement to Mashable, an OpenAI spokesperson said, 'Addressing hallucinations across all our models is an ongoing area of research, and we're continually working to improve their accuracy and reliability.' Evaluation benchmarks are tricky. They can be subjective, especially if developed in-house, and research has found flaws in their datasets and even how they evaluate models. Plus, some rely on different benchmarks and methods to test accuracy and hallucinations. HuggingFace's hallucination benchmark evaluates models on the "occurrence of hallucinations in generated summaries" from around 1,000 public documents and found much lower hallucination rates across the board for major models on the market than OpenAI's evaluations. GPT-4o scored 1.5 percent, GPT-4.5 preview 1.2 percent, and o3-mini-high with reasoning scored 0.8 percent. It's worth noting o3 and o4-mini weren't included in the current leaderboard. That's all to say; even industry standard benchmarks make it difficult to assess hallucination rates. Then there's the added complexity that models tend to be more accurate when tapping into web search to source their answers. But in order to use ChatGPT search, OpenAI shares data with third-party search providers, and Enterprise customers using OpenAI models internally might not be willing to expose their prompts to that. Regardless, if OpenAI is saying their brand-new o3 and o4-mini models hallucinate higher than their non-reasoning models, that might be a problem for its users. UPDATE: Apr. 21, 2025, 1:16 p.m. EDT This story has been updated with a statement from OpenAI.

OpenAI's latest AI models report high ‘hallucination' rate: What does it mean — and why is this significant?

Indian Express

15-05-2025

Indian Express

OpenAI's latest AI models report high ‘hallucination' rate: What does it mean — and why is this significant?

A technical report released by artificial intelligence (AI) research organisation OpenAI last month found that the company's latest models — o3 and o4-mini — generate more errors than its older models. Computer scientists call the errors made by chatbots 'hallucinations'. The report revealed that o3 — OpenAI's most powerful system — hallucinated 33% of the time when running its PersonQA benchmark test, which involves answering questions about public figures. The o4-mini hallucinated at 48%. To make matters worse, OpenAI said it does not even know why these models are hallucinating more than their predecessors. Here is a look at what AI hallucinations are, why they happen, and why the new report about OpenAI's models is significant. When the term AI hallucinations began to be used to refer to errors made by chatbots, it had a very narrow definition. It was used to refer to those instances when AI models would give fabricated information as output. For instance, in June 2023, a lawyer in the United States admitted using ChatGPT to help write a court filing as the chatbot had added fake citations to the submission, which pointed to cases that never existed. Today, hallucination has become a blanket term for various types of mistakes made by chatbots. This includes instances when the output is factually correct but not actually relevant to the question that was asked. ChatGPT, o3, o4-mini, Gemini, Perplexity, Grok and many more are all examples of what are known as large language models (LLMs). These models essentially take in text inputs and generate synthesised outputs in the form of text. LLMs are able to do this as they are built using massive amounts of digital text taken from the Internet. Simply put, computer scientists feed these models a lot of text, helping them identify patterns and relationships within that text, and predict text sequences and produce some output in response to a user's input (known as a prompt). Note that LLMs are always making a guess while giving an output. They do not know for sure what is true and what is not — these models cannot even fact-check their output against, let's say, Wikipedia like humans can. LLMs 'know what words are and they know which words predict which other words in the context of words. They know what kinds of words cluster together in what order. And that's pretty much it. They don't operate like you and me,' scientist Gary Marcus wrote on his Substack, Marcus on AI. As a result, when an LLM is trained on, for example, inaccurate text, they give inaccurate outputs, thereby hallucinating. However, even accurate text cannot stop LLMs from making mistakes. That's because to generate new text (in response to a prompt), these models combine billions of patterns in unexpected ways. So, there is always a possibility that LLMs give fabricated information as output. And as LLMs are trained on vast amounts of data, experts do not understand why they generate a particular sequence of text at a given moment. Hallucination has been an issue with AI models from the start, and big AI companies and labs, in the initial years, repeatedly claimed that the problem would be resolved in the near future. It did seem possible, as after they were first launched, models tended to hallucinate less with each update. However, after the release of the new report about OpenAI's latest models, it has increasingly become clear that hallucination is here to stay. Also, the issue is not limited to just OpenAI. Other reports have shown that Chinese startup DeepSeek's R-1 model has double-digit rises in hallucination rates compared with previous models from the company. This means that the application of AI models has to be limited, at least for now. They cannot be used, for example, as a research assistant (as models create fake citations in research papers) or a paralegal-bot (because models give imaginary legal cases). Computer scientists like Arvind Narayanan, who is a professor at Princeton University, think that, to some extent, hallucination is intrinsic to the way LLMs work, and as these models become more capable, people will use them for tougher tasks where the failure rate will be high. In a 2024 interview, he told Time magazine, 'There is always going to be a boundary between what people want to use them [LLMs] for, and what they can work reliably at… That is as much a sociological problem as it is a technical problem. And I do not think it has a clean technical solution.'

AI hallucination puts firms at risk? New insurance covers legal costs

Business Standard

12-05-2025

Business
Business Standard

AI hallucination puts firms at risk? New insurance covers legal costs

Insurers at Lloyd's of London have introduced a new insurance product designed to protect businesses from financial losses arising from artificial intelligence system failures, according to a report by The Financial Times. The insurance, developed by Y Combinator-backed start-up Armilla, provides coverage for legal claims against companies when AI tools generate inaccurate outputs. The policy offers financial protection against potential legal consequences, including court-awarded damages and associated legal expenses. It responds to rising concerns over AI's tendency to produce unreliable or misleading information—commonly referred to as "hallucinations" in AI terminology. As companies increasingly integrate AI tools to enhance efficiency, they also face growing risks from errors caused by flaws in AI models that lead to hallucinations or fabricated information. Last year, a tribunal ruled that Air Canada must honour a discount its customer service chatbot had wrongly offered. What is an AI hallucination? An AI hallucination occurs when an algorithm generates information that appears credible but is actually false or misleading. Computer scientists use the term to describe such errors, which have been seen in various AI tools. These hallucinations can cause significant problems when AI is used in sensitive areas. While some errors are relatively harmless—such as a chatbot giving a wrong answer—others can have serious consequences. In high-stakes settings like legal cases or health insurance decisions, inaccuracies can severely impact people's lives. Unlike systems that follow strict, human-defined rules, AI models operate based on statistical patterns and probabilities, which makes occasional errors inevitable. Though minor mistakes may not pose a big problem for most users, hallucinations become critical when dealing with legal, medical, or confidential business matters. Karthik Ramakrishnan, Armilla's chief executive, said the new product could encourage more companies to adopt AI by addressing fears that tools like chatbots might break down or make errors. Hallucinations getting worse despite AI advances Despite improvements by companies like OpenAI and Google in reducing hallucination rates, the problem has worsened with the introduction of newer reasoning models. OpenAI's internal assessments found that its latest models hallucinate more often than earlier versions. Specifically, OpenAI reported that its most advanced model, o3, produced hallucinations 33 per cent of the time on the PersonQA benchmark, which tests the ability to answer questions about public figures—more than double the rate of its earlier model, o1.