logo
AI-Generated Medical Podcasts Deceive Even the Experts

AI-Generated Medical Podcasts Deceive Even the Experts

Medscape5 hours ago
For the first time, researchers have evaluated the use of artificial intelligence (AI) to generate podcasts from peer-reviewed scientific articles. Using Google's NotebookLM application, the team created podcast scripts based on studies published in the European Journal of Cardiovascular Nursing ( EJCN ). The results were eye-opening: Half of the authors did not realize the podcast hosts were not human.
The study assessed whether AI could simulate a realistic scientific dialogue between two speakers discussing published research. Findings were presented at this year's Annual Congress of the Association of Cardiovascular Nursing and Allied Professions and simultaneously published in EJCN .
Too Polished to Be Human?
The AI-generated podcasts averaged 10 minutes. Without knowing the content was machine-produced, most authors said their research was summarized clearly, in simple language, and with structured delivery. Some even remarked that the 'hosts' sounded like they had clinical or nursing backgrounds.
But not all feedback was glowing. Several participants felt the delivery was unnaturally smooth — lacking hesitation, repetition, or organic back-and-forth — prompting suspicion of AI involvement. Others flagged mispronounced medical terms and factual errors. One podcast, for example, focused on heart failure diagnosis instead of management. Another spoke exclusively about women, even though the study included men.
Some authors were also distracted by the overly enthusiastic, American-style tone of the narration, with superlatives used to describe modest results. A more academic tone, they suggested, would be more appropriate — particularly if the tool is used for scientific audiences.
Promise for Science Communication
Led by Philip Moons, PhD, from the KU Leuven Department of Public Health and Primary Care, Leuven, Belgium, the researchers created 10 podcasts based on EJCN articles. Despite imperfections, they concluded that 'AI-generated podcasts are able to summarize key findings in an easily understandable and engaging manner.'
'Podcasts were found to be most appropriate for patients and the public but could be useful for researchers and healthcare professionals as well if they were tailored accordingly,' the authors wrote.
'It was striking how accurate the podcasts were in general. Knowing that we are just at the beginning of this kind of AI-generated podcasts, the quality will become better over time, probably within the next few months,' Moons said in a press release. He believes the tool could help researchers more effectively disseminate their work.
Moons got the idea after testing NotebookLM with one of his own papers, shortly after Google launched the feature in September 2024. 'When I did a first test case with one of my own articles, I was flabbergasted by the high quality and how natural it sounded.'
After generating the podcasts — ranging from 5 to 17 minutes — researchers were asked to evaluate the content through a questionnaire and a 30-minute video interview.
Missing Context but Strong Engagement
All participating authors agreed that the podcasts effectively conveyed the key findings of their research in simple, accessible language. Many also found the conversational format between two 'hosts' made the content more engaging.
Several praised the hosts' professionalism. 'I was curious about their background — it really seemed like they had medical or nursing training,' one author said. However, some were unsettled by the lack of introductory context. The podcasts provided no information about the identity of the speakers or how the audio was produced, leaving listeners uncertain about the source.
Overall, most found the content reliable, though a few pointed out factual errors. One author noted that obesity was described as a 'habit,' potentially misleading listeners by implying it is merely a lifestyle choice.
Despite these issues, half of the authors — one of whom was an AI expert — did not realize the podcasts were machine-generated. Many said they were 'shocked' or 'amazed' by the quality. Most of the participants were regular podcast listeners. Even those who suspected AI involvement were surprised by how natural and fluent the results sounded.
Expanding Research Reach
All authors agreed that future versions should clearly disclose AI involvement. Most also recommended adopting a more academic tone if the target audience includes researchers, along with a greater focus on study methods and limitations.
Although patients and the general public were identified as the primary audience, the researchers noted that AI-generated podcasts could serve as a cost-effective, scalable way for healthcare professionals to stay current with new research. They also suggested the format could help broaden the visibility and reach of scientific publications.
'This could be a sustainable model to get the message out to people who do not typically read scientific journals,' Moons said. Still, he emphasized the need for human oversight 'to add nuance.' He envisions a hybrid model in which AI-generated content is supplemented with human input.
That vision may already be taking shape. The beta version of Google's NotebookLM (currently available only in English) now allows real-time interaction with the AI. After launching a podcast, users can ask questions directly to one of the 'hosts.' The AI generates a spoken response, and the podcast then continues — seamlessly integrating human-machine dialogue.
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Why Dell Technologies Shares Are Climbing Today
Why Dell Technologies Shares Are Climbing Today

Yahoo

time7 minutes ago

  • Yahoo

Why Dell Technologies Shares Are Climbing Today

July 18 - Shares of Dell Technologies (NYSE:DELL) surged about 3% Friday after Bank of America projected the company's earnings per share could nearly double by 2030, driven by rising demand for enterprise and sovereign AI infrastructure. BofA analysts said Dell's EPS may climb to $19.01 by the end of the decade, fueled by expanding AI server deployments and a rebound in cloud-related capital expenditures. Revenue growth is expected to accelerate to 12% annually over the next five years, up from 2% in the previous five. The firm added that AI-related investments could prompt a shift from elevated operating costs to increased capital spending across the enterprise segment. That trend, it noted, may drive stronger operating profit and cash flow through the end of the decade. While margins could be pressured by the mix of AI hardware, Bank of America expects overall profitability to rise as adoption widens. It reiterated a Buy rating on Dell and raised its price target to $165 from $155. This article first appeared on GuruFocus.

Trane Technologies is a Key Contributor to the IEA's Four-Point Action Plan on Energy Efficiency
Trane Technologies is a Key Contributor to the IEA's Four-Point Action Plan on Energy Efficiency

Yahoo

time7 minutes ago

  • Yahoo

Trane Technologies is a Key Contributor to the IEA's Four-Point Action Plan on Energy Efficiency

SWORDS, IRELAND / / July 18, 2025 / Trane Technologies (NYSE:TT), a global climate innovator, was a key contributor to the to the International Energy Agency's (IEA) CEO Four-Point Action Plan for energy efficiency published following the IEA's 10th Annual Global Conference. The conference, co-hosted by IEA Executive Director Dr. Fatih Birol and European Commissioner for Energy and Housing Dan Jørgensen, was held in Brussels last month and focused on implementing the COP28 goal of doubling energy efficiency progress this decade. The plan also followed the company's pledge, signed last month at the conference, to promote business as a key force in implementing energy efficiency technologies. As a signatory of the IEA's CEO Letter of Commitment, Trane Technologies agrees to advance energy efficiency for industrial productivity, decarbonization, and competitiveness. Speaking at the conference, Jose La Loggia, CEO EMEA at Trane Technologies, stated: "The absurdity of rejecting heat into the atmosphere is a misguided, traditional energy practice that needs to change. The world needs to adopt the readily available sustainable solutions that can integrate heating and cooling systems by reusing waste heat, thereby enabling high levels of energy efficiency." La Loggia participated in the roundtable where the Brussels CEO Four-Point Action Plan was developed. This strategic plan, launched in partnership with the Energy Efficiency Movement and co-hosted by the European Commission, aims to accelerate public-private collaboration to double energy efficiency progress by 2030 through a system perspective approach, shared expertise, risk mitigation, development and innovation and strengthened industrial competitiveness. These initiatives aim to strengthen public-private collaboration for stronger action on energy efficiency. Through its actions at the conference, Trane Technologies affirms its 2030 Sustainability Commitments, which include achieving carbon neutral operations by 2030 and reducing absolute energy consumption by 10% by 2030 from a 2019 baseline. Explore Trane Technologies' 2024 Sustainability Report to learn how the company is integrating energy-efficient processes into its operations. # # # About Trane TechnologiesTrane Technologies is a global climate innovator. Through our strategic brands Trane® and Thermo King®, and our portfolio of environmentally responsible products and services, we bring efficient and sustainable climate solutions to buildings, homes and transportation. Visit the International Energy AgencyThe IEA is at the heart of global dialogue on energy, providing authoritative analysis, data, policy recommendations, and real-world solutions to help countries provide secure and sustainable energy for all. View additional multimedia and more ESG storytelling from Trane Technologies on Contact Info:Spokesperson: Trane TechnologiesWebsite: info@ SOURCE: Trane Technologies View the original press release on ACCESS Newswire Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

AI's Achilles Heel—Puzzles Humans Solve in Seconds Often Defy Machines
AI's Achilles Heel—Puzzles Humans Solve in Seconds Often Defy Machines

Scientific American

time8 minutes ago

  • Scientific American

AI's Achilles Heel—Puzzles Humans Solve in Seconds Often Defy Machines

There are many ways to test the intelligence of an artificial intelligence —conversational fluidity, reading comprehension or mind-bendingly difficult physics. But some of the tests that are most likely to stump AIs are ones that humans find relatively easy, even entertaining. Though AIs increasingly excel at tasks that require high levels of human expertise, this does not mean that they are close to attaining artificial general intelligence, or AGI. AGI requires that an AI can take a very small amount of information and use it to generalize and adapt to highly novel situations. This ability, which is the basis for human learning, remains challenging for AIs. One test designed to evaluate an AI's ability to generalize is the Abstraction and Reasoning Corpus, or ARC: a collection of tiny, colored-grid puzzles that ask a solver to deduce a hidden rule and then apply it to a new grid. Developed by AI researcher François Chollet in 2019, it became the basis of the ARC Prize Foundation, a nonprofit program that administers the test—now an industry benchmark used by all major AI models. The organization also develops new tests and has been routinely using two (ARC-AGI-1 and its more challenging successor ARC-AGI-2). This week the foundation is launching ARC-AGI-3, which is specifically designed for testing AI agents—and is based on making them play video games. Scientific American spoke to ARC Prize Foundation president, AI researcher and entrepreneur Greg Kamradt to understand how these tests evaluate AIs, what they tell us about the potential for AGI and why they are often challenging for deep-learning models even though many humans tend to find them relatively easy. Links to try the tests are at the end of the article. On supporting science journalism If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today. [ An edited transcript of the interview follows. ] What definition of intelligence is measured by ARC-AGI-1? Our definition of intelligence is your ability to learn new things. We already know that AI can win at chess. We know they can beat Go. But those models cannot generalize to new domains; they can't go and learn English. So what François Chollet made was a benchmark called ARC-AGI—it teaches you a mini skill in the question, and then it asks you to demonstrate that mini skill. We're basically teaching something and asking you to repeat the skill that you just learned. So the test measures a model's ability to learn within a narrow domain. But our claim is that it does not measure AGI because it's still in a scoped domain [in which learning applies to only a limited area]. It measures that an AI can generalize, but we do not claim this is AGI. How are you defining AGI here? There are two ways I look at it. The first is more tech-forward, which is 'Can an artificial system match the learning efficiency of a human?' Now what I mean by that is after humans are born, they learn a lot outside their training data. In fact, they don't really have training data, other than a few evolutionary priors. So we learn how to speak English, we learn how to drive a car, and we learn how to ride a bike—all these things outside our training data. That's called generalization. When you can do things outside of what you've been trained on now, we define that as intelligence. Now, an alternative definition of AGI that we use is when we can no longer come up with problems that humans can do and AI cannot—that's when we have AGI. That's an observational definition. The flip side is also true, which is as long as the ARC Prize or humanity in general can still find problems that humans can do but AI cannot, then we do not have AGI. One of the key factors about François Chollet's benchmark... is that we test humans on them, and the average human can do these tasks and these problems, but AI still has a really hard time with it. The reason that's so interesting is that some advanced AIs, such as Grok, can pass any graduate-level exam or do all these crazy things, but that's spiky intelligence. It still doesn't have the generalization power of a human. And that's what this benchmark shows. How do your benchmarks differ from those used by other organizations? One of the things that differentiates us is that we require that our benchmark to be solvable by humans. That's in opposition to other benchmarks, where they do 'Ph.D.-plus-plus' problems. I don't need to be told that AI is smarter than me—I already know that OpenAI's o3 can do a lot of things better than me, but it doesn't have a human's power to generalize. That's what we measure on, so we need to test humans. We actually tested 400 people on ARC-AGI-2. We got them in a room, we gave them computers, we did demographic screening, and then gave them the test. The average person scored 66 percent on ARC-AGI-2. Collectively, though, the aggregated responses of five to 10 people will contain the correct answers to all the questions on the ARC2. What makes this test hard for AI and relatively easy for humans? There are two things. Humans are incredibly sample-efficient with their learning, meaning they can look at a problem and with maybe one or two examples, they can pick up the mini skill or transformation and they can go and do it. The algorithm that's running in a human's head is orders of magnitude better and more efficient than what we're seeing with AI right now. What is the difference between ARC-AGI-1 and ARC-AGI-2? So ARC-AGI-1, François Chollet made that himself. It was about 1,000 tasks. That was in 2019. He basically did the minimum viable version in order to measure generalization, and it held for five years because deep learning couldn't touch it at all. It wasn't even getting close. Then reasoning models that came out in 2024, by OpenAI, started making progress on it, which showed a step-level change in what AI could do. Then, when we went to ARC-AGI-2, we went a little bit further down the rabbit hole in regard to what humans can do and AI cannot. It requires a little bit more planning for each task. So instead of getting solved within five seconds, humans may be able to do it in a minute or two. There are more complicated rules, and the grids are larger, so you have to be more precise with your answer, but it's the same concept, more or less.... We are now launching a developer preview for ARC-AGI-3, and that's completely departing from this format. The new format will actually be interactive. So think of it more as an agent benchmark. How will ARC-AGI-3 test agents differently compared with previous tests? If you think about everyday life, it's rare that we have a stateless decision. When I say stateless, I mean just a question and an answer. Right now all benchmarks are more or less stateless benchmarks. If you ask a language model a question, it gives you a single answer. There's a lot that you cannot test with a stateless benchmark. You cannot test planning. You cannot test exploration. You cannot test intuiting about your environment or the goals that come with that. So we're making 100 novel video games that we will use to test humans to make sure that humans can do them because that's the basis for our benchmark. And then we're going to drop AIs into these video games and see if they can understand this environment that they've never seen beforehand. To date, with our internal testing, we haven't had a single AI be able to beat even one level of one of the games. Can you describe the video games here? Each 'environment,' or video game, is a two-dimensional, pixel-based puzzle. These games are structured as distinct levels, each designed to teach a specific mini skill to the player (human or AI). To successfully complete a level, the player must demonstrate mastery of that skill by executing planned sequences of actions. How is using video games to test for AGI different from the ways that video games have previously been used to test AI systems? Video games have long been used as benchmarks in AI research, with Atari games being a popular example. But traditional video game benchmarks face several limitations. Popular games have extensive training data publicly available, lack standardized performance evaluation metrics and permit brute-force methods involving billions of simulations. Additionally, the developers building AI agents typically have prior knowledge of these games—unintentionally embedding their own insights into the solutions.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store