logo
What Happens When LLM's Run Out Of Useful Data?

What Happens When LLM's Run Out Of Useful Data?

Forbes25-06-2025
There is one obvious solution to a looming shortage of written content: have LLMs generate more of it.
By SAP Insights Team
Most of us feel like we're drowning in data. And yet, in the world of generative AI, a looming data shortage is keeping some researchers up at night.
A 2024 report from the nonprofit watchdog Epoch AI projected that large language models (LLMs) could run out of fresh, human-generated training data as soon as 2026. Earlier this year, the ubiquitous Elon Musk declared that 'the cumulative sum of human knowledge has been exhausted in AI training,' and that the doomsday scenario envisioned by some AI researchers 'happened basically last year.'
GenAI is unquestionably a technology whose breakthroughs in power and sophistication have generally relied on ever-larger datasets to train on. Beneath the flurry of investment, adoption, and general GenAI activity, a quiet concern has surfaced: What if the fuel driving all this progress is running low?
There is one obvious solution to a looming shortage of written content: have LLMs generate more of it.
The role of synthetic data in large language models
Synthetic data is computer-generated information that has the same statistical properties and patterns as real data but doesn't include real-world records. Amazon recently had success using this method with LLM-generated pairs of questions and answers to fine-tune a customer service model. Because the task was narrow and the outputs were easily reviewed by human beings, the additional training on synthetic data helped the model get better at responding accurately to customer inquiries, even in scenarios it hadn't seen before.
Another use case for synthetic data is for businesses using proprietary data to train bespoke LLMs—whether building them from scratch or, more commonly, layering retrieval-augmented generation (RAG) atop a commercial foundation model. In many such cases, the proprietary data involved is tightly structured, such as with historical transaction records formatted like spreadsheets with dates, locations, and dollar amounts. In contexts like these, LLM-generated synthetic data is often indistinguishable from the real thing and just as effective for training.
But in less narrowly defined training scenarios, specifically the development of those big commercial models RAG relies on, the risks of training on synthetic data are real.
The most widely cited danger has the dramatic name 'model collapse.' In a 2024 study published in Nature, researchers showed that when models are repeatedly trained on synthetic data generated by other models, they gradually lose diversity and accuracy, drifting further from the true distribution of real-world data until they can no longer produce reliably useful output.
Mohan Shekar, SAP's AI and quantum adoption lead for cloud-based ERP, likens the process to 'model incest.' With every successive iteration, a model trained on its own output will tend to reinforce biases and flaws that may at first have been barely noticeable, until those minor defects become debilitating deformities.
Long before reaching these extreme states, models trained with synthetic data have also been shown to exhibit a dullness and predictability reflecting their lack of fresh input. Such models may still have their uses, especially for mundane work and applications, but as Shekar puts it, 'If you're trying to innovate—really innovate—[a synthetic-data–trained model] won't get you there. It's just remixing what you already had.'
Some researchers, including OpenAI CEO Sam Altman, have long argued that innovation in how models are trained may soon start to matter more than what they're trained on. The next wave of breakthroughs, the thinking goes, may come from rethinking the architecture and logic of training itself and then applying those new ideas.
Yaad Oren, head of research and innovation at SAP, is confident that such a shift is underway. Recent advances in training methods already mean 'you can shrink the amount of data needed to build a robust product,' he says.
One of those recent advances is multimodal training: building models that learn not just from text but also from video, audio, and other inputs. These models can effectively multiply one dataset by another, combining different types of information to create new datasets.
Oren gives the example of voice recognition in cars during a rainstorm. For car manufacturers trying to train an LLM to understand and follow spoken natural-language instructions from a driver, rain in the background presents a hurdle. One unwieldy solution, says Oren, would be to 'record millions of hours of people talking in the rain,' he says, to familiarize the model with the soundwaves produced by a person asking for directions in a torrential downpour.
More elegant and practical, though, is to combine an existing dataset of human speech with existing datasets of 'different rain and weather sounds,' he says. The result is a model that can decipher speech across a full range of meteorological backdrops—without ever having encountered the combination firsthand.
Even more promising is the potential impact of quantum computing on model training. 'What quantum brings in,' says Shekar, 'is a way to look at all the possible options that exist within your datasets and derive patterns, connections, and possibilities that were not visible before.'
Quantum computing could even increase the total supply of usable data by accessing the vast, underutilized oceans of so-called unstructured data, says Shekar. 'Instead of needing 50 labeled images to train a model,' he says, 'you might be able to throw in 5,000 unlabeled ones and still get a more accurate result.'
That could be a very big deal indeed. AI engineers have long had the same feelings about unstructured data that physicists have about dark matter: an exquisite blend of awe, annoyance, and yearning. If quantum computing finally unlocks it, especially in tandem with multimodal learning and other innovations, today's fears of a data drought might recede.
A version of this story appears on SAP.com
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Get Ring's Indoor and Outdoor Cameras Together for 50% off Today
Get Ring's Indoor and Outdoor Cameras Together for 50% off Today

Yahoo

time6 minutes ago

  • Yahoo

Get Ring's Indoor and Outdoor Cameras Together for 50% off Today

Tired of guessing who's at your door or what your pets are doing while you're away? A good home security setup used to cost an arm and a leg, but now you can snag a Ring Battery Doorbell and a Ring Indoor Cam (2nd Gen) for just $70. For a limited time, this smart home duo is available for 50% off its usual price of $140. If you've been eyeing either of these Ring devices, now's a good time to pounce—this is the lowest price we've seen for this bundle in the last six months. The Ring Battery Doorbell gives you "66% more vertical coverage," meaning you'll see everything from the top of your visitor's head to the packages they might have left on your porch. You'll also get instant notifications on your phone when motion is detected, and with two-way talk, you can chat with anyone at your door, even when you're not home. It's battery-powered, so installation is very easy regardless of your wiring setup. To go along with the outdoor view, you get the Ring Indoor Cam (2nd Gen) in white. This little camera offers 1080p HD video and Color Night Vision, ensuring you can keep an eye on things even at night when the lights are out. It even has a manual privacy cover, so you can physically block the camera and microphone when you're at home if you'd like. Whether you're checking in on the kids, pets, or just making sure everything's secure, this is a handy camera. Together, these two devices create a nice package that covers both your front door and any indoor area of your choosing. Again, Amazon is listing this as a 'limited time deal,' but we don't really know how limited a time it will be. If you've been on the fence about beefing up your home security, now's the time to act and get peace of mind without breaking the bank.

Bliss to Hold Second-Annual "Block Party" on Amazon with Major $4 Promotion on Mineral SPF Honoring July 4th
Bliss to Hold Second-Annual "Block Party" on Amazon with Major $4 Promotion on Mineral SPF Honoring July 4th

Yahoo

time13 minutes ago

  • Yahoo

Bliss to Hold Second-Annual "Block Party" on Amazon with Major $4 Promotion on Mineral SPF Honoring July 4th

NEW YORK, July 3, 2025 /PRNewswire/ -- Bliss ( the spa-founded skincare brand renowned for its innovative self-care products, announced today that the brand will hold its successful 4th of July Block Party promotion on Amazon for a second year in a row. In celebration of the biggest summer holiday, Amazon shoppers can pick up the brand's bestselling Block Star Tinted Mineral SPF 30 sunscreen for just $4 on July 4th of this year – a discount of more than 80%. The brand's newest launch in the SPF category, the Block Star Clarifying Mineral SPF 30 sunscreen, will also be available at a steep discount during the Block Party. "We see a massive surge in sunscreen interest once Memorial Day hits," says Sara Mitzner, VP of Marketing. "Not just in sales, but search volume and engagement with the category across our social media channels. We know this promotion is a great way to introduce folks to the brand and the product." Last year's promotion saw a lift of more than 600% in sales. This year's Block Party will also include the new Block Star Clarifying Sunscreen, which launched in May of this year. This new mineral-based SPF is infused with the brand's noted skin-calming ingredients, making this sunscreen an excellent choice for those with blemish-prone skin. This formula will be available for just $8 during the Block Party sale, a discount of more than 70%. "The Clarifying Sunscreen is a really unique product for us," Mitzner adds. "It's a non-comedogenic and oil-free formula that includes a lot of skin-friendly ingredients like Salicylic Acid and Witch Hazel, along with 100% mineral sunscreens. Of course, it's all clean and vegan." Bliss also worked with brand ambassador Nicole "Snooki" Polizzi on some additional social media content in support of the sale. The brand recently named the reality star its "Summer Superhero" in a campaign supporting its SPF products. Snooki even proclaimed Block Star 30 as her favorite sunscreen – a high honor from the Jersey Shore star. This promotion runs through midnight ET on July 4th, 2025, and will only be available through the Bliss branded Amazon store. About BlissBliss was founded originally as a spa in New York's Soho neighborhood in 1996 and quickly became a sensation for its fun, judgement-free approach to skincare. It soon transitioned into a full-fledged retail line providing innovative and transformative products that celebrate self-care and promote real, observable results. The brand is known for clean ingredients and thoughtful formulas, all marketed with a healthy dose of fun. The brand is part of the AS Beauty portfolio, a group focused on developing global beauty brands that deliver real-world solutions to a diverse consumer base. What sets AS Beauty apart is its dedication to female-founded brands that tell a unique story while making the latest innovations accessible to all. For media inquiries, please contact:Blissbliss@ View original content to download multimedia: SOURCE Bliss Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

Jim Cramer on Amazon: 'It's a Huge Position for My Charitable Trust'
Jim Cramer on Amazon: 'It's a Huge Position for My Charitable Trust'

Yahoo

time20 minutes ago

  • Yahoo

Jim Cramer on Amazon: 'It's a Huge Position for My Charitable Trust'

Inc. (NASDAQ:AMZN) is one of the 21 stocks on Jim Cramer's radar. While discussing the stock, Cramer noted that the company has a 'great read on the consumer,' as he remarked: 'Alright, now on Monday, we have a very special interview with Andy Jassy. He's the CEO of Amazon. It's a huge position for my Charitable Trust. We'll talk about everything from investments in rural delivery to new and improved Alexa to the growth of Amazon Web Services, the juicy gross margins of Amazon advertising, and why we love being Prime members. Given its retail presence, Amazon's got a great read on the consumer. Its international business seems to have turned the corner. A customer entering an internet retail store, illustrating the convenience of online shopping. Inc. (NASDAQ:AMZN) provides a broad range of retail, digital, and cloud services, including product sales, subscriptions, advertising, and media content. While we acknowledge the potential of AMZN as an investment, we believe certain AI stocks offer greater upside potential and carry less downside risk. If you're looking for an extremely undervalued AI stock that also stands to benefit significantly from Trump-era tariffs and the onshoring trend, see our free report on the best short-term AI stock. READ NEXT: The Best and Worst Dow Stocks for the Next 12 Months and 10 Unstoppable Stocks That Could Double Your Money. Disclosure: None.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store