Latest news with #UwePeters

AI chatbots oversimplify scientific studies and gloss over critical details — the newest models are especially guilty

Yahoo

05-07-2025

Science
Yahoo

AI chatbots oversimplify scientific studies and gloss over critical details — the newest models are especially guilty

When you buy through links on our articles, Future and its syndication partners may earn a commission. Large language models (LLMs) are becoming less "intelligent" in each new version as they oversimplify and, in some cases, misrepresent important scientific and medical findings, a new study has found. Scientists discovered that versions of ChatGPT, Llama and DeepSeek were five times more likely to oversimplify scientific findings than human experts in an analysis of 4,900 summaries of research papers. When given a prompt for accuracy, chatbots were twice as likely to overgeneralize findings than when prompted for a simple summary. The testing also revealed an increase in overgeneralizations among newer chatbot versions compared to previous generations. The researchers published their findings in a new study April 30 in the journal Royal Society Open Science. "I think one of the biggest challenges is that generalization can seem benign, or even helpful, until you realize it's changed the meaning of the original research," study author Uwe Peters, a postdoctoral researcher at the University of Bonn in Germany, wrote in an email to Live Science. "What we add here is a systematic method for detecting when models generalize beyond what's warranted in the original text." It's like a photocopier with a broken lens that makes the subsequent copies bigger and bolder than the original. LLMs filter information through a series of computational layers. Along the way, some information can be lost or change meaning in subtle ways. This is especially true with scientific studies, since scientists must frequently include qualifications, context and limitations in their research results. Providing a simple yet accurate summary of findings becomes quite difficult. "Earlier LLMs were more likely to avoid answering difficult questions, whereas newer, larger, and more instructible models, instead of refusing to answer, often produced misleadingly authoritative yet flawed responses," the researchers wrote. Related: AI is just as overconfident and biased as humans can be, study shows In one example from the study, DeepSeek produced a medical recommendation in one summary by changing the phrase "was safe and could be performed successfully" to "is a safe and effective treatment option." Another test in the study showed Llama broadened the scope of effectiveness for a drug treating type 2 diabetes in young people by eliminating information about the dosage, frequency, and effects of the medication. If published, this chatbot-generated summary could cause medical professionals to prescribe drugs outside of their effective parameters. In the new study, researchers worked to answer three questions about 10 of the most popular LLMs (four versions of ChatGPT, three versions of Claude, two versions of Llama, and one of DeepSeek). They wanted to see if, when presented with a human summary of an academic journal article and prompted to summarize it, the LLM would overgeneralize the summary and, if so, whether asking it for a more accurate answer would yield a better result. The team also aimed to find whether the LLMs would overgeneralize more than humans do. The findings revealed that LLMs — with the exception of Claude, which performed well on all testing criteria — that were given a prompt for accuracy were twice as likely to produce overgeneralized results. LLM summaries were nearly five times more likely than human-generated summaries to render generalized conclusions. The researchers also noted that LLMs transitioning quantified data into generic information were the most common overgeneralizations and the most likely to create unsafe treatment options. These transitions and overgeneralizations have led to biases, according to experts at the intersection of AI and healthcare. "This study highlights that biases can also take more subtle forms — like the quiet inflation of a claim's scope," Max Rollwage, vice president of AI and research at Limbic, a clinical mental health AI technology company, told Live Science in an email. "In domains like medicine, LLM summarization is already a routine part of workflows. That makes it even more important to examine how these systems perform and whether their outputs can be trusted to represent the original evidence faithfully." Such discoveries should prompt developers to create workflow guardrails that identify oversimplifications and omissions of critical information before putting findings into the hands of public or professional groups, Rollwage said. While comprehensive, the study had limitations; future studies would benefit from extending the testing to other scientific tasks and non-English texts, as well as from testing which types of scientific claims are more subject to overgeneralization, said Patricia Thaine, co-founder and CEO of Private AI — an AI development company. Rollwage also noted that "a deeper prompt engineering analysis might have improved or clarified results," while Peters sees larger risks on the horizon as our dependence on chatbots grows. "Tools like ChatGPT, Claude and DeepSeek are increasingly part of how people understand scientific findings," he wrote. "As their usage continues to grow, this poses a real risk of large-scale misinterpretation of science at a moment when public trust and scientific literacy are already under pressure." RELATED STORIES —Cutting-edge AI models from OpenAI and DeepSeek undergo 'complete collapse' when problems get too difficult, study reveals —'Foolhardy at best, and deceptive and dangerous at worst': Don't believe the hype — here's why artificial general intelligence isn't what the billionaires tell you it is —Current AI models a 'dead end' for human-level intelligence, scientists agree For other experts in the field, the challenge we face lies in ignoring specialized knowledge and protections. "Models are trained on simplified science journalism rather than, or in addition to, primary sources, inheriting those oversimplifications," Thaine wrote to Live Science. "But, importantly, we're applying general-purpose models to specialized domains without appropriate expert oversight, which is a fundamental misuse of the technology which often requires more task-specific training." In December 2024, Future Publishing agreed a deal with OpenAI in which the AI company would bring content from Future's 200-plus media brands to OpenAI's users. You can read more about the partnership here.

AI makes science easy, but is it getting it right? Study warns LLMs are oversimplifying critical research

Time of India

05-07-2025

Health
Time of India

AI makes science easy, but is it getting it right? Study warns LLMs are oversimplifying critical research

In a world where AI tools have become daily companions—summarizing articles, simplifying medical research, and even drafting professional reports, a new study is raising red flags. As it turns out, some of the most popular large language models (LLMs), including ChatGPT, Llama, and DeepSeek, might be doing too good a job at being too simple—and not in a good way. According to a study published in the journal Royal Society Open Science and reported by Live Science, researchers discovered that newer versions of these AI models are not only more likely to oversimplify complex information but may also distort critical scientific findings. Their attempts to be concise are sometimes so sweeping that they risk misinforming healthcare professionals, policymakers, and the general public. From Summarizing to Misleading Led by Uwe Peters, a postdoctoral researcher at the University of Bonn , the study evaluated over 4,900 summaries generated by ten of the most popular LLMs, including four versions of ChatGPT, three of Claude, two of Llama, and one of DeepSeek. These were compared against human-generated summaries of academic research. The results were stark: chatbot-generated summaries were nearly five times more likely than human ones to overgeneralize the findings. And when prompted to prioritize accuracy over simplicity, the chatbots didn't get better—they got worse. In fact, they were twice as likely to produce misleading summaries when specifically asked to be precise. 'Generalization can seem benign, or even helpful, until you realize it's changed the meaning of the original research,' Peters explained in an email to Live Science. What's more concerning is that the problem appears to be growing. The newer the model, the greater the risk of confidently delivered—but subtly incorrect—information. You Might Also Like: AI cannot replace all jobs, says expert: 3 types of careers that could survive the automation era When a Safe Study Becomes a Medical Directive In one striking example from the study, DeepSeek transformed a cautious phrase; 'was safe and could be performed successfully', into a bold and unqualified medical recommendation: 'is a safe and effective treatment option.' Another summary by Llama eliminated crucial qualifiers around the dosage and frequency of a diabetes drug, potentially leading to dangerous misinterpretations if used in real-world medical settings. Max Rollwage, vice president of AI and research at Limbic, a clinical mental health AI firm, warned that 'biases can also take more subtle forms, like the quiet inflation of a claim's scope.' He added that AI summaries are already integrated into healthcare workflows, making accuracy all the more critical. Why Are LLMs Getting This So Wrong? Part of the issue stems from how LLMs are trained. Patricia Thaine, co-founder and CEO of Private AI, points out that many models learn from simplified science journalism rather than from peer-reviewed academic papers. This means they inherit and replicate those oversimplifications especially when tasked with summarizing already simplified content. Even more critically, these models are often deployed across specialized domains like medicine and science without any expert supervision. 'That's a fundamental misuse of the technology,' Thaine told Live Science, emphasizing that task-specific training and oversight are essential to prevent real-world harm. You Might Also Like: Does ChatGPT suffer from hallucinations? OpenAI CEO Sam Altman admits surprise over users' blind trust in AI iStock Part of the issue stems from how LLMs are trained. Patricia Thaine, co-founder and CEO of Private AI, points out that many models learn from simplified science journalism rather than from peer-reviewed academic papers. (Image: iStock) The Bigger Problem with AI and Science Peters likens the issue to using a faulty photocopier each version of a copy loses a little more detail until what's left barely resembles the original. LLMs process information through complex computational layers, often trimming the nuanced limitations and context that are vital in scientific literature. Earlier versions of these models were more likely to refuse to answer difficult questions. Ironically, as newer models have become more capable and 'instructable,' they've also become more confidently wrong. 'As their usage continues to grow, this poses a real risk of large-scale misinterpretation of science at a moment when public trust and scientific literacy are already under pressure,' Peters cautioned. Guardrails, Not Guesswork While the study's authors acknowledge some limitations, including the need to expand testing to non-English texts and different types of scientific claims they insist the findings should be a wake-up call. Developers need to create workflow safeguards that flag oversimplifications and prevent incorrect summaries from being mistaken for vetted, expert-approved conclusions. In the end, the takeaway is clear: as impressive as AI chatbots may seem, their summaries are not infallible, and when it comes to science and medicine, there's little room for error masked as simplicity. Because in the world of AI-generated science, a few extra words, or missing ones, can mean the difference between informed progress and dangerous misinformation.

Economic Times

05-07-2025

Health
Economic Times

AI makes science easy, but is it getting it right? Study warns LLMs are oversimplifying critical research

Think AI is making science easier to understand? Think again. A recent study finds large language models often overgeneralize complex research sometimes dangerously so. From misrepresenting drug data to offering flawed medical advice, the problem appears to be growing. As chatbot use skyrockets, experts warn of a looming crisis in how we interpret science. Tired of too many ads? Remove Ads From Summarizing to Misleading Tired of too many ads? Remove Ads When a Safe Study Becomes a Medical Directive Why Are LLMs Getting This So Wrong? Part of the issue stems from how LLMs are trained. Patricia Thaine, co-founder and CEO of Private AI, points out that many models learn from simplified science journalism rather than from peer-reviewed academic papers. (Image: iStock) The Bigger Problem with AI and Science Guardrails, Not Guesswork Tired of too many ads? Remove Ads In a world where AI tools have become daily companions—summarizing articles, simplifying medical research, and even drafting professional reports, a new study is raising red flags. As it turns out, some of the most popular large language models (LLMs), including ChatGPT, Llama, and DeepSeek, might be doing too good a job at being too simple—and not in a good to a study published in the journal Royal Society Open Science and reported by Live Science, researchers discovered that newer versions of these AI models are not only more likely to oversimplify complex information but may also distort critical scientific findings. Their attempts to be concise are sometimes so sweeping that they risk misinforming healthcare professionals, policymakers, and the general by Uwe Peters, a postdoctoral researcher at the University of Bonn , the study evaluated over 4,900 summaries generated by ten of the most popular LLMs, including four versions of ChatGPT, three of Claude, two of Llama, and one of DeepSeek. These were compared against human-generated summaries of academic results were stark: chatbot-generated summaries were nearly five times more likely than human ones to overgeneralize the findings. And when prompted to prioritize accuracy over simplicity, the chatbots didn't get better—they got worse. In fact, they were twice as likely to produce misleading summaries when specifically asked to be precise.'Generalization can seem benign, or even helpful, until you realize it's changed the meaning of the original research,' Peters explained in an email to Live Science. What's more concerning is that the problem appears to be growing. The newer the model, the greater the risk of confidently delivered—but subtly incorrect— one striking example from the study, DeepSeek transformed a cautious phrase; 'was safe and could be performed successfully', into a bold and unqualified medical recommendation: 'is a safe and effective treatment option.' Another summary by Llama eliminated crucial qualifiers around the dosage and frequency of a diabetes drug, potentially leading to dangerous misinterpretations if used in real-world medical Rollwage, vice president of AI and research at Limbic, a clinical mental health AI firm, warned that 'biases can also take more subtle forms, like the quiet inflation of a claim's scope.' He added that AI summaries are already integrated into healthcare workflows, making accuracy all the more of the issue stems from how LLMs are trained. Patricia Thaine, co-founder and CEO of Private AI, points out that many models learn from simplified science journalism rather than from peer-reviewed academic papers. This means they inherit and replicate those oversimplifications especially when tasked with summarizing already simplified more critically, these models are often deployed across specialized domains like medicine and science without any expert supervision. 'That's a fundamental misuse of the technology,' Thaine told Live Science, emphasizing that task-specific training and oversight are essential to prevent real-world likens the issue to using a faulty photocopier each version of a copy loses a little more detail until what's left barely resembles the original. LLMs process information through complex computational layers, often trimming the nuanced limitations and context that are vital in scientific versions of these models were more likely to refuse to answer difficult questions. Ironically, as newer models have become more capable and 'instructable,' they've also become more confidently wrong.'As their usage continues to grow, this poses a real risk of large-scale misinterpretation of science at a moment when public trust and scientific literacy are already under pressure,' Peters the study's authors acknowledge some limitations, including the need to expand testing to non-English texts and different types of scientific claims they insist the findings should be a wake-up call. Developers need to create workflow safeguards that flag oversimplifications and prevent incorrect summaries from being mistaken for vetted, expert-approved the end, the takeaway is clear: as impressive as AI chatbots may seem, their summaries are not infallible, and when it comes to science and medicine, there's little room for error masked as in the world of AI-generated science, a few extra words, or missing ones, can mean the difference between informed progress and dangerous misinformation.

AI Models like ChatGPT and DeepSeek frequently exaggerate scientific findings, study reveals

Time of India

15-05-2025

Science
Time of India

AI Models like ChatGPT and DeepSeek frequently exaggerate scientific findings, study reveals

According to a new study published in the journal Royal Society Open Science, Large Language Models (LLMs) such as ChatGPT and DeepSeek often exaggerate scientific findings while summarising research papers. Researchers Uwe Peters from Utrecht University and Benjamin Chin-Yee from Western University and the University of Cambridge analysed 4900 AI-generated summaries from ten leading LLMs. Their findings revealed that up to 73 percent of summaries contained overgeneralised or inaccurate conclusions. Surprisingly, the problem worsened when users explicitly prompted the models to prioritise accuracy, and newer models like ChatGPT 4 performed worse than older versions. What are the findings of the study The study assessed how accurately leading LLMS summarised abstracts and full-length articles from prestigious science and medical journals, including Nature, Science, and The Lancet. Over a period of one year, the researchers collected and analysed 4,900 summaries generated by AI systems such as ChatGPT, Claude, DeepSeek, and LLaMA. Six out of ten models routinely exaggerated claims, often by changing cautious, study-specific statements like 'The treatment was effective in this study' into broader, definitive assertions like 'The treatment is effective.' These subtle shifts in tone and tense can mislead readers into thinking that scientific findings apply more broadly than they actually do. by Taboola by Taboola Sponsored Links Sponsored Links Promoted Links Promoted Links You May Like They Lost Their Money - Learn From Their Lesson Expertinspector Click Here Undo Why are these exaggerations happening The tendency of AI models to exaggerate scientific findings appears to stem from both the data they are trained on and the behaviour they learn from user interactions. According to the study's authors, one major reason is that overgeneralizations are already common in scientific literature. When LLMs are trained on this content, they learn to replicate the same patterns, often reinforcing existing flaws rather than correcting them. Another contributing factor is user preference. Language models are optimised to generate responses that sound helpful, fluent, and widely applicable. As co-author Benjamin Chin-Yee explained, the models may learn that generalisations are more pleasing to users, even if they distort the original meaning. This results in summaries that may appear authoritative but fail to accurately represent the complexities and limitations of the research. Accuracy prompts backfire Contrary to expectations, prompting the models to be more accurate actually made the problem worse. When instructed to avoid inaccuracies, the LLMs were nearly twice as likely to produce summaries with exaggerated or overgeneralised conclusions compared to when given a simple, neutral prompt. 'This effect is concerning,' said Peters. 'Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they'll get a more reliable summary. Our findings prove the opposite.' Humans still do better To compare AI and human performance directly, the researchers analysed summaries written by people alongside those generated by chatbots. The results showed that AI was nearly five times more likely to make broad generalisations than human writers. This gap underscores the need for careful human oversight when using AI tools in scientific or academic contexts. Recommendations for safer use To mitigate these risks, the researchers recommend using models like Claude, which demonstrated the highest generalisation accuracy in their tests. They also suggest setting LLMs to a lower "temperature" to reduce creative embellishments and using prompts that encourage past-tense, study-specific reporting. 'If we want AI to support science literacy rather than undermine it,' Peters noted, 'we need more vigilance and testing of these systems in science communication contexts.' AI Masterclass for Students. Upskill Young Ones Today!– Join Now

Latest news with #UwePeters

AI chatbots oversimplify scientific studies and gloss over critical details — the newest models are especially guilty

AI makes science easy, but is it getting it right? Study warns LLMs are oversimplifying critical research

AI makes science easy, but is it getting it right? Study warns LLMs are oversimplifying critical research

AI Models like ChatGPT and DeepSeek frequently exaggerate scientific findings, study reveals

Get Started Now: Download the App