logo
AI chatbots need more books to learn from. These libraries are opening their stacks

AI chatbots need more books to learn from. These libraries are opening their stacks

CAMBRIDGE, Mass. (AP) — Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artistsand others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes 'all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'' Chapel said. 'Our collections are held for public use, and anything we digitized as part of this project will be made public.'
Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language,' said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'
————
The Associated Press and OpenAI have a licensing and technology agreement that allows OpenAI access to part of AP's text archives.
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

'Wealth Does Not Stop. Rich Stops.' -- Grant Cardone Explains How To Build Multi-Generational Wealth
'Wealth Does Not Stop. Rich Stops.' -- Grant Cardone Explains How To Build Multi-Generational Wealth

Yahoo

time17 minutes ago

  • Yahoo

'Wealth Does Not Stop. Rich Stops.' -- Grant Cardone Explains How To Build Multi-Generational Wealth

A high salary does not guarantee that you will build multi-generational wealth for your children. What you do with your money can build wealth, and that's a key lesson Grant Cardone shared in a recent video. "Wealth does not stop. Rich stops," Cardone explained. Here's how you can use this insight to approach your finances differently and set the stage for multi-generational wealth. Don't Miss: 7,000+ investors have joined Timeplast's mission to eliminate microplastics—now it's your turn to $100k+ in investable assets? – no cost, no obligation. Your Wealth Can Outlive You Cardone started the clip by mentioning that Walmart (NYSE:WMT) founder Sam Walton still has five children who are on the billionaires list. His children are still on that list because Walton built the largest global retailer. Walmart continues to generate positive cash flow for its heirs. While Walton doesn't earn any additional income, his assets have outlived him. Those assets continue to provide wealth for his children. You can start a business and then hand it off to your children in the future. That's building wealth. You can also make a six-figure salary, but that salary vanishes the moment you pass away or can no longer fulfill the job's responsibilities due to age or another factor. It's also possible to get laid off from your job, while you have more control over your financial destiny if you start a business. Trending: This AI-Powered Trading Platform Has 5,000+ Users, 27 Pending Patents, and a $43.97M Valuation — Your Wealth Gets Split Up After Death Cardone also mentioned that your wealth will get taxed and split up after you pass away. If you built your fortune on a rich salary with no assets, your children may not see a lot of it. Estate taxes and local taxes can also come into play, further reducing the inheritance you provide for your children. That's why starting a business is valuable. The cash flow continues to arrive, and your heirs can improve the business to boost its profits. If you don't start a business, you can also invest in assets like stock and real estate that generate income. It's better to be reminded of how your money will be taxed and split up when you pass your inheritance on to your heirs. It can give you the extra motivation to invest in passive assets and start a company that your children can A Brand Cardone wraps up the video clip by explaining that he's building a brand. He believes people will continue to talk about him for 50 years after he passes away. That type of brand will give his children more options. They can expand his work and continue to generate income from the company that he continues to build. It's easier to build a brand than ever before. You don't have to invest in supplies, retail locations, or any high-budget items. You can get started by creating a website, offering services that you can scale, producing content, and interacting with people on social media. Those efforts won't guarantee that you build a successful brand. However, if you are strategic about your core services, how you will identify your target customers, what marketing channels you use, and how you guide people along the customer journey, it's possible to develop a successful brand. Read Next: Warren Buffett once said, "If you don't find a way to make money while you sleep, you will work until you die." Image: Shutterstock Up Next: Transform your trading with Benzinga Edge's one-of-a-kind market trade ideas and tools. Click now to access unique insights that can set you ahead in today's competitive market. Get the latest stock analysis from Benzinga? WALMART (WMT): Free Stock Analysis Report This article 'Wealth Does Not Stop. Rich Stops.' -- Grant Cardone Explains How To Build Multi-Generational Wealth originally appeared on © 2025 Benzinga does not provide investment advice. All rights reserved.

3 stock picks for second half of 2025
3 stock picks for second half of 2025

Yahoo

time17 minutes ago

  • Yahoo

3 stock picks for second half of 2025

We are more than halfway through 2025. Hennion & Walsh chief investment officer Kevin Mahn shares three stocks he likes for the second half of the year. To watch more expert insights and analysis on the latest market action, check out more Market Domination Overtime here. Related videos Top Stock Market Highlights of the Week: DFI Retail Group, Trump's Trade Deal with Japan and Singapore Post Kotak Mahindra Bank's Q1 profits drop more than expected on higher provisions Smart Reads of the Week: STI Surge, MAS Capital Boost, and Long-Term Growth Picks Thailand's PTTEP buys full control in offshore gas block from Chevron for $450 million Error while retrieving data Sign in to access your portfolio Error while retrieving data Error while retrieving data Error while retrieving data Error while retrieving data

Legendary Wall Street forecaster Bob Doll is having his best year
Legendary Wall Street forecaster Bob Doll is having his best year

Yahoo

time17 minutes ago

  • Yahoo

Legendary Wall Street forecaster Bob Doll is having his best year

Legendary Wall Street forecaster Bob Doll is having his best year originally appeared on TheStreet. Stock market prognosticators are wrong so frequently that observers can rightly wonder if they're making forecasts using the oldest soothsaying methods, drawing pebbles from a pile, dropping hot wax into water, using random dots on paper or, of course, trying to find something magical in numbers. Yet at the start of every year – and again at the midpoint – countless market watchers take their crack at divining the future, mixing educated conjecture, informed hunches and the occasional WAG (wild-ass guess).Measured just about any way possible, most of those projections are wrong. CXO Advisory Group analyzed more than 6,500 forecasts—using methodologies ranging from fundamental to technical analysis—made by 68 experts on the U.S. stock market from 2005 through 2012. The investigation found that the accuracy of the forecasts was below 47% on average. That loses to a coin flip. From Black Monday star to today's afterthought Bad calls tend to be forgotten quickly, as soon as a forecast is updated based on new information. Winning picks are lionized and celebrated, even though the expert may have less staying power than a bull market rally. Wall Streeters sometimes call the tendency to place too much trust in a guru who made the most recent good call the 'Elaine Garzarelli Effect.' Garzarelli made her reputation as a Lehman Brothers investment strategist by urging clients to get out of the stock market the week before the Black Monday crash in 1987. That call made her one of the most widely quoted strategists on the Street, but it was also the pinnacle of her success. Whether it was brilliant prescience or dumb luck may be argued forever, but she never really duplicated that success. Garzarelli failed to generate much interest when she tried running mutual funds and a call on stocks being 25% undervalued late in 2007 as the global financial crisis was looming, further dimmed her star. While old-timers remember her name – she runs Garzarelli Research and her newsletter suggests that she is currently bullish on small- and mid-caps plus transportation stocks – she is like many one-time stars, known more for one right call than for being right consistently over years or decades. Bob Doll's forecast record beats coin flips, by a lot One Wall Street analyst who hasn't shied away from forecasts -- and has a stellar track record -- is Bob Doll, chief executive and investment officer at Crossmark Global Investors. In a 40-plus-year career, Doll has also been the top equity strategist at Blackrock, Nuveen, Merrill Lynch, and Oppenheimer Funds; at each of those stops, Doll—a regular guest on CNBC, Fox Business, and seemingly all financial media outlets—has started each year with 10 forecasts for the coming 12 holds his picks up to a grader each year and historically has been right 72% of the time. That's roughly where he stood with his 2024 prognostications. He has said that his best years ever put him at just above 80%. Entering 2025, Doll was expecting 'fewer tailwinds, but more tail risks.' His picks reflected that, calling for 'some bumps in the road, but some good news and probably more volatility,' in an interview on Money Life with Chuck Jaffe that aired in January. Now, seven months later, Doll is getting the results he expected. Eight of Doll's 10 picks tend to be tied to the economy and stock market, with one tied to politics and a wildcard. This is what Doll was calling for entering 2025, and how it's turning out: Slower economic growth as unemployment rises past 4.5%. The jury is out on this one, but if unemployment hits Doll's target – it's currently just north of 4% -- mark this as a win. Sticky inflation that stays above Fed's 2% target, causing the central bank to cut rates less than expected. Barring a Fed surprise, this one's on track.10-year Treasury yields primarily between 4% and 5% with wider credit spreads. The 10-year Treasury has spent the year in that range; credit spreads were up around the tariff tantrum but have narrowed since. But if there's an economic slowdown, they will widen and this one will be a winner. Earnings fail to achieve the market's consensus 14% expectation entering the year, and yet every sector has up earnings. This forecast is virtually a lock at this point, even with Doll expecting a second-half slowdown that could hurt some sectors. Equity volatility rises, with the VIX average approaching 20. The VIX averaged 18.5 in the first quarter and 24.4 in the second, so this call –and the VIX has only been this high in two of the last 13 calendar years – might have seemed like a longshot but now looks like a sure shot. Stocks experience a 10% correction and price/earnings ratios contract. The correction went on the books in April, and P/E ratios are down and appear likely to stay that way. This can be marked in the win column. Equal-weighted portfolios beat cap-weighted portfolios and value beats growth. Both of these conditions are true at the moment; the question is whether that will hold up through December. Financials, energy and consumer staples outperform healthcare, technology and industrials. This looked like a sure thing into June, when the margin of outperformance shrank. If financials weaken, it could put this one in jeopardy; barring that, it looks like another win. 'Congress passes the Trump tax cut extension, reduces regulation, but tariffs and deportation are less than expected.' The tariff forecast here is the one thing where Doll looks like he's wrong and won't recover; by year's end, this one is likely to look half-right, making it the one clear blemish that's efforts make progress but fall far short of $2 trillion in annualized savings. Even Doll acknowledges that this was a softball. In a July 22 interview on Money Life with Chuck Jaffe, Doll acknowledged that he now expects to be right at least 70 percent of the time, 'but I wish coming into the year we knew which seven we were going to get right. We could make a lot of money. The problem is you don't know which ones you're going to get right and wrong.' What Bob Doll think happens for the rest of 2025 As for the rest of 2025, Doll gave three quick assessments for where things stand now: "One, the economy is slowing. We just don't know how much it's going to slow. Two, we're beginning to see tariffs show up in the inflation numbers. We don't know how much. And number three we have this tailwind called [artificial intelligence] which is real and is keeping things moving." Further, Doll said he expects the AI play to broaden out. The tailwind called AI has also been particularly strong at the high end of the market. We all were expecting some measure of breadth this year. Are we going to see the breadth show up at some point? Yeah. Well, it obviously occurred in the first quarter, and then it went away in the second quarter. While Doll noted that tariffs seem to be showing up in slight increases in the Consumer Price Index, or CPI, he did not think they would cause a spike in inflation over the rest of the year. "I don't think [the impact of tariffs on inflation] it's going to be horrible," he said. "It's just going to be there. Remember, only 15% approximately of our GDP is from outside the United States. The other 85 is pretty domestic. So it's limited by how much of the economy it really affects. "Now, having said that, remember the Fed saying 'We've got to get inflation down to 2% and they're struggling at 3% and we're not going to get to 2%. And that means all these people who want the Fed to lower rates are going to have to wait a little bit longer."Legendary Wall Street forecaster Bob Doll is having his best year first appeared on TheStreet on Jul 27, 2025 This story was originally reported by TheStreet on Jul 27, 2025, where it first appeared.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store