logo
AI chatbots need more books to learn from. These libraries are opening their stacks

AI chatbots need more books to learn from. These libraries are opening their stacks

Independent12-06-2025
Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artistsand others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes "all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University 's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'" Chapel said. 'Our collections are held for public use, and anything we digitized as part of this project will be made public.'
Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language," said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'
————
The Associated Press and OpenAI have a licensing and technology agreement that allows OpenAI access to part of AP's text archives.
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Uber dials up new fights with plaintiffs lawyers
Uber dials up new fights with plaintiffs lawyers

Reuters

time12 minutes ago

  • Reuters

Uber dials up new fights with plaintiffs lawyers

July 24 (Reuters) - (Billable Hours is Reuters' weekly report on lawyers and money. Please send tips or suggestions to opens new tab.) As it grapples with a slew of lawsuits from passengers, ride-hailing giant Uber is again taking aim at members of the plaintiffs bar, accusing an attorney in one case of misusing confidential information and suing another group of lawyers for allegedly filing fraudulent personal injury claims. In a lawsuit, opens new tab filed on Monday in federal court in Los Angeles, Uber said a group of lawyers had conspired with medical providers to create and submit artificially inflated medical bills. Uber's complaint alleged 'unscrupulous personal injury attorneys" were sending clients to medical providers for unnecessary or unrelated medical treatment in a kickback scheme designed to generate fraudulent injury claims. "As this lawsuit shows, we won't hesitate to act when we uncover misconduct on our platform," said Adam Blinick, who leads Uber's state and local public policy team for the United States and Canada. Uber said 'fraud and legal abuse raise costs for everyone.' Igor Fradkin, an attorney named as a defendant in the case, did not immediately respond to a request for comment. In a separate case, Uber asked a U.S. judge last week to stop a lawyer from allegedly using and sharing confidential corporate information produced in a series of lawsuits by passengers alleging they were sexually assaulted by the ridesharing service's drivers. Uber accused, opens new tab lawyer Bret Stanley of misusing internal policy materials in the litigation in connection with other cases Stanley is pursuing against the company over vehicle crashes. In a filing, Uber said it was "facing the risk of competitive harm from the repeated improper disclosure" of protected business materials. The company has denied any wrongdoing in the lawsuits. Stanley, in a statement to Reuters, disputed Uber's allegation that he breached a so-called protective order in the litigation, and said he will respond to the claims in court this week. Uber's fights with plaintiffs lawyers go back years. A federal judge in Manhattan in 2016 faulted Uber for authorizing what the court called an 'intrusive and clandestine' effort to dig up dirt on a plaintiff and his lawyer in a price-fixing lawsuit. More recently, Nevada's highest court in January rejected an effort by Uber and a political action committee called Nevadans for Fair Recovery to cap the contingency fees that lawyers can earn in civil lawsuits in the state to 20% of the amount their clients' recover. Uber was a key backer of the ballot initiative, which would have amended Nevada law and taken effect in 2027. Opponents said the fee cap would have been the most restrictive in the country. The state high court in its ruling said the ballot initiative suffered from "misleading and confusing" language. Uber is also a plaintiff in at least two other lawsuits filed this year against attorneys and firms in federal courts in New York and Florida alleging personal-injury related fraud. The American Association of Justice, a trade group for trial lawyers, accused companies of targeting their adversaries in the plaintiffs' bar out of desperation. 'When bullies can't win on the merits, they resort to intimidation tactics against those who dare to stand up to them," the group told Reuters. There are other recent examples of companies turning the tables on the lawyers who sued them. Manufacturing giant 3M in June accused several attorneys in a lawsuit in federal court in Kentucky of pursuing fraudulent claims in the hopes of pressuring the company to settle. An early hearing in that case was scheduled for Thursday. In March, plaintiffs law firm Simmons Hanly Conroy defeated a lawsuit in federal court in Chicago by plastic pipe maker JM Eagle that accused it of 'systemic fraud and misconduct' over hundreds of asbestos personal injury cases. Simmons Hanly said in a statement at the time that the firm would "not be intimidated by unwarranted legal attacks designed to smear its reputation and derail its pursuit of justice for victims of asbestos exposure." -- Lobbying activity related to President Donald Trump's "big, beautiful" tax-and-spending bill, his shifting tariffs and other administration moves helped fuel a banner second quarter for some law and lobbying firms in Washington. Ballard Partners, which has ties to the Trump administration, appeared to be the biggest winner. The firm said its federal lobbying revenue for the second quarter of 2025 rose more than 300% compared to the same period last year, and was up 47% from the first quarter, reaching $20.6 million. -- French President Emmanuel Macron and his wife Brigitte turned to Alexandria, Virginia-based law firm Clare Locke for their defamation lawsuit in Delaware alleging right-wing influencer Candace Owens has waged a lie-filled "campaign of global humiliation" against them by repeatedly claiming that Brigitte Macron is a man. Lead counsel Tom Clare did not immediately respond to a request for comment regarding his firm's involvement. Clare in a press release said that this "lawsuit is about bringing that truth to light and seeking justice for the Macrons by holding Ms. Owens to account." Owens did not respond to Reuters' request for comment regarding the Macrons' lawsuit. Clare Locke was founded in 2014 by Clare and his wife, Libby Locke, both of whom were formerly partners at Kirkland & Ellis. The firm was co-counsel to Dominion Voting Systems in the voting machine vendor's lawsuit against Fox Corp that netted a $787.5 million settlement in 2023. -- In the year leading up to its merger with U.S. law firm Kramer Levin, global firm Herbert Smith Freehills recorded its highest-ever revenue and profits. The firm increased revenues by 4% to 1.358 billion pounds ($1.84 billion) and grew profits by 9.5% to 486.9 million pounds ($659.26 million) in the fiscal year that ended April 30, it said on Thursday. The firm in June combined with New York-founded Kramer Levin as part of an effort to expand in the U.S. legal market. Herbert Smith Freehills Kramer has 2,700 lawyers globally. Other large global firms with UK roots pointed to their U.S. growth when they reported financial results this week. Linklaters on Tuesday said U.S. profits grew by 57% last year, and Clifford Chance on Wednesday reported an 18% increase in U.S. revenue. Read more: New law firms bank on 'boutique' edge Clock ticks for Jackson Walker, US Trustee in ethics case involving ex-judge Litigation funders get a boost in budget bill drama, court wins ($1 = 0.7386 pounds)

OJ maker sues Trump over tariffs saying his plan could drive up consumer costs by 25%
OJ maker sues Trump over tariffs saying his plan could drive up consumer costs by 25%

The Independent

time12 minutes ago

  • The Independent

OJ maker sues Trump over tariffs saying his plan could drive up consumer costs by 25%

Sign up for the daily Inside Washington email for exclusive US coverage and analysis sent to your inbox Get our free Inside Washington email Get our free Inside Washington email Email * SIGN UP I would like to be emailed about offers, events and updates from The Independent. Read our Privacy notice A New Jersey-based orange juice manufacturer has announced that it is suing Donald Trump's administration over the president's threatened 50 percent tariffs on Brazil, a country whose citrus exports it depends on. Johanna Foods warns that Trump's blanket levy, due to take effect on August 1, could result in a $70 million hit to its business, leading to likely layoffs and a 25 percent increase in the price of orange juice on supermarket shelves. The company said that it and its subsidiary, Johanna Beverage Company, supply nearly three-quarters of private-label, not-from-concentrate orange juice consumed in the United States. They supply a number of major mass grocery retailers, including Aldi, Walmart, Sam's Club, Wegman's, Safeway, and Albertsons. open image in gallery Trump has benen at odds with the Brazilian government over its investigation into his political ally, former Brazil President Jair Bolsonaro, shown here ( AP ) Trump's tariff would represent a particular blow because, by the Department of Agriculture's own estimate, Brazil supplies more than half of the orange juice on shelves in American stores, and 80 percent of the global total. The Independent has approached the White House for comment on the rationale behind the president's decision. Johanna argues in its lawsuit that Trump's threat against Brazil, revealed in a letter sent to the country's president Luiz Inacio Lula da Silva on July 9, was not a formal executive order and did not provide any legal basis for the action. The president followed up his letter by alleging on Truth Social that President Lula had engaged in a 'Witch Hunt that should end IMMEDIATELY!' after charges were brought against his right-wing predecessor, Jair Bolsonaro, a friend of the American who has been nicknamed the 'Trump of the Tropics' and who has been accused of plotting a coup. Lula responded angrily on X with a vow to reciprocate with tariffs of his own on American imports, writing: 'Brazil is a sovereign nation with independent institutions and will not accept any form of tutelage. open image in gallery A worker harvests oranges in Mogi Guacu, Brazil ( AP ) 'Any measure to increase tariffs unilaterally will be responded to in light of Brazil's Law of Economic Reciprocity.' Trump's economic adviser, Kevin Hassett,struggled to explain the tariffs in an interview with Jonathan Karl on ABC News earlier this month, saying: 'The bottom line is the president has been very frustrated with negotiations with Brazil and also with the actions of Brazil. In the end, though, you know, we're trying to put America first.' Karl argued that the Bolsonaro case before the Brazilian Supreme Court was irrelevant to the U.S. national interest and left the adviser floundering by asking: 'On what authority does the president have to impose tariffs on a country because he doesn't like what that country's judicial system is handling a specific case?' Even before the tariffs have been enacted, Trump's threats have had an adverse impact on the South American nation's citrus belt, where orange prices have already dropped to $8 a box, about half of what they were in July 2024, according to the University of Sao Paulo's Cepea index. 'You are not going to spend money to harvest and not have anyone to sell to,' dismayed Minas Gerais farmer Fabricio Vidal told Reuters.

Why Tesla could be banned from selling cars in this US state
Why Tesla could be banned from selling cars in this US state

The Independent

time12 minutes ago

  • The Independent

Why Tesla could be banned from selling cars in this US state

California 's Department of Motor Vehicles (DMV) is seeking to ban Tesla car sales for at least 30 days, alleging the company misled customers over its self-driving capabilities. A DMV lawyer told an Oakland court that Tesla's marketing and feature names, such as 'Autopilot' and 'Full Self-Driving', imply autonomy which the vehicles do not possess. The DMV has received complaints from Tesla owners regarding these features, with a 2023 state law prohibiting deceptive marketing of semi-autonomous driving systems. Separately, a Florida court heard testimony about a 2019 fatal accident where a Tesla driver claimed the Autopilot feature failed, one of many incidents recorded by the US National Highway Traffic Safety Administration. Despite regulatory challenges, Elon Musk recently discussed plans for a significant rollout of self-driving Robotaxis, though market analysts express scepticism about widespread regulatory approvals.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store