logo
AI firms say they can't respect copyright. These researchers tried.

AI firms say they can't respect copyright. These researchers tried.

Washington Post05-06-2025
Happy Thursday! I'm Nitasha Tiku, The Washington Post's tech culture reporter, filling in for Will Oremus on today's Tech Brief. Send tips about AI to: nitasha.tiku@washpost.com.
AI firms say they can't respect copyright. These researchers tried.
As the policy debate over AI and fair use heats up, a new paper suggests there's a more transparent — if time-consuming — alternative to slurping up web content without permission.
Top artificial intelligence companies argue that it's impossible to build today's powerful large-language models — the GPT in ChatGPT — unless they can freely scrape copyrighted materials from the internet to train their AI systems.
But few AI developers have tried the more ethical route — until now.
A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain. They tested the dataset quality by using it to train a 7 billion parameter language model, which performed about as well as comparable industry efforts, such as Llama 2-7B, which Meta released in 2023.
A paper published Thursday detailing their effort also reveals that the process was painstaking, arduous and impossible to fully automate.
The group built an AI model that is significantly smaller than the latest offered by OpenAI's ChatGPT or Google's Gemini, but their findings appear to represent the biggest, most transparent and rigorous effort yet to demonstrate a different way of building popular AI tools.
That could have implications for the policy debate swirling around AI and copyright.
The paper itself does not take a position on whether scraping text to train AI is fair use.
That debate has reignited in recent weeks with a high-profile lawsuit and dramatic turns around copyright law and enforcement in both the U.S. and U.K.
On Wednesday, Reddit said it was suing Anthropic, alleging that it accessed data from the social media discussion board without a licensing agreement, according to The Wall Street Journal. The same day, the U.K.'s House of Commons offered concessions on a controversial bill that would allow AI companies to train on copyrighted material.
These moves follow President Donald Trump's firing last month of the head of the U.S. Copyright Office, Shira Perlmutter. Her ouster brought more attention to the office's recent report on AI, which cast doubt on fair use applying to copyrighted works in generative AI.
AI companies and their investors, meanwhile, have long argued that a better way is not feasible.
In April 2023, Sy Damle, a lawyer representing the venture capital firm Andreessen Horowitz, told the U.S. Copyright Office: 'The only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data.' Later that year, in comments to the U.K. government, OpenAI said, '[I]t would be impossible to train today's leading AI models without using copyrighted materials.'
And in January 2024, Anthropic's expert witness in a copyright trial asserted that 'the hypothetical competitive market for licenses covering data to train cutting-edge LLMs would be impracticable,' court documents show.
While AI policy papers often discuss the need for more open data and experts argue about whether large language models should be trained on licensed data from publishers, there's little effort to put theory into action, the paper's co-author, Aviya Skowron, head of policy at the nonprofit research institute Eleuther AI, told The Post.
'I would also like those people to get curious about what this task actually entails,' Skowron said.
As it turns out, the task involves a lot of humans.
That's because of the technical challenges of data not being formatted in a way that's machine readable, as well as the legal challenges of figuring out what license applies to which website, a daunting prospect when the industry is rife with improperly licensed data.
'This isn't a thing where you can just scale up the resources that you have available' like access to more computer chips and a fancy web scraper, said Stella Biderman, Eleuther AI's executive director. 'We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that's just really hard.'
Still, the group managed to unearth new datasets that can be used ethically.
Those include a set of 130,000 English language books in the Library of Congress, which is nearly double the size of the popular-books dataset Project Gutenberg.
The group's initiative also builds on recent efforts to develop more ethical, but still useful, datasets, such as FineWeb from Hugging Face, the open-source repository for machine learning.
Eleuther AI pioneered an analogous open-source effort in 2020, creating an often-cited dataset called the Pile. A site that hosted the dataset had to take it down in 2023 after a Digital Millennium Copyright Act request from the Danish anti-piracy group Rights Alliance, which targeted the fact that the Pile contained Books3, a dataset of books that Meta is being sued over.
The new dataset is called Common Pile v0.1, and the model is called Comma v0.1 — a deliberate reference to the group's belief that they will be able to find more text that is openly licensed or in the public domain that can then be used to train bigger models.
Still, Biderman remained skeptical that this approach could find enough content online to match the size of today's state-of-the-art models.
The group of authors represented 14 different institutions, including MIT, CMU, and University of Toronto, as well as other nonprofits such as Vector Institute and the Allen Institute for Artificial Intelligence.
Biderman said she didn't expect companies such as OpenAI and Anthropic to start adopting the same laborious process, but she hoped it would encourage them to at least rewind back to 2021 or 2022, when AI companies still shared a few sentences of information about what their models were trained on.
'Even partial transparency has a huge amount of social value and a moderate amount of scientific value,' she said.
Musk rails against Trump tax bill, calling it a 'disgusting abomination' (Jacob Bogage and Theodoric Meyer)
Federal judge blocks Florida from enforcing social media ban for kids while lawsuit continues (Associated Press)
Apple and Alibaba's AI rollout in China delayed by Trump trade war (Financial Times)
Trump renegotiating Biden-era Chips Act grants, Lutnick says (Reuters)
US removes 'safety' from AI Safety Institute (The Verge)
5 AI bots took our tough reading test. One was smartest — and it wasn't ChatGPT (Geoffrey A. Fowler)
You are hardwired to blindly trust AI. Here's how to fight it. (Shira Ovide)
Reddit sues Anthropic, alleges unauthorized use of site's data (Wall Street Journal)
Amazon to invest $10 billion in North Carolina to expand cloud, AI infrastructure (Reuters)
Germans are buying more electric cars, but not Teslas (New York Times)
Google warns hackers stealing Salesforce data from companies (Bloomberg)
Chinese hacked US Telecom a year before known wireless breaches (Bloomberg)
ChatGPT can now read your Google Drive and Dropbox (The Verge)
Google DeepMind's CEO thinks AI will make humans less selfish (Wired)
The creatives and academics rejecting AI — at work and at home (The Guardian)
That's all for today — thank you so much for joining us! Make sure to tell others to subscribe to the Tech Brief. Get in touch with Will (via email or social media) for tips, feedback or greetings!
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Mark Cuban said the Trump administration needs to crack down on ads in AI models
Mark Cuban said the Trump administration needs to crack down on ads in AI models

Business Insider

timea few seconds ago

  • Business Insider

Mark Cuban said the Trump administration needs to crack down on ads in AI models

"Shark Tank" star Mark Cuban said on Saturday that the White House should "make it illegal for AI models to offer advertising." Cuban said in an X post addressed to David Sacks, the White House's AI and crypto czar, that the administration should "examine referral fees as well." "The last thing we need is to have algorithms designed to maximize revenue driving LLM output and interactions," Cuban wrote. "They are already recommending brands and we don't know if they are getting paid for it. We need to have learned our lessons from algos in social media," he added. Cuban said in a subsequent post on Saturday that he would be willing to accept advertising on AI models if they are "identified as an ad" and kept "completely independent from the user generated chats." Cuban's proposal comes just days after the Trump administration unveiled its 28-page " AI Action Plan" on Wednesday. Back in January, President Donald Trump had signed an executive order calling for "existing AI policies and directives that act as barriers to American AI innovation" to be revoked. Trump has adopted a relatively light-touch approach toward AI regulation compared to his predecessor, President Joe Biden. In October 2023, Biden signed an executive order demanding greater transparency from companies developing AI tools. Trump's new "AI Action Plan" proposed withholding federal funding from states that want to impose "burdensome" AI regulations. Cuban and the White House did not respond to requests for comment from Business Insider. Social-media déjà vu Cuban's worries may not be unfounded. Major AI players such as have been deepening their leadership bench with former executives from social media platforms like Facebook and Instagram. In May, OpenAI chief Sam Altman said he had hired Fidji Simo, the CEO and chair of Instacart, to serve as OpenAI's new CEO of applications. Before she joined Instacart, Simo worked at Meta, where she oversaw Facebook's app and advertising products. Last year, OpenAI hired Kevin Weil as its chief product officer. Weil was previously vice president of product at Instagram and senior vice president of product at Twitter. OpenAI's rival, Anthropic, made a similar move in May 2024 when it hired Mike Krieger, cofounder and former CTO of Instagram, as its chief product officer. Cuban has long warned about the risks and dangers that could come with AI tools like chatbots. He told comedian Jon Stewart in a podcast interview that aired in 2023 that online misinformation "is only going to get worse" with the proliferation of AI tools. "Once these things start taking on a life of their own, it will be difficult for us to define why and how the machine makes the decisions it makes, and who controls the machine," Cuban said. Last week, Cuban wrote in an X post that he expects AI companies to hoard talent and intellectual property to stay ahead of their competitors. "If you create valuable IP, encrypt and silo it. Let companies bid on it. Or just use it for your own behind a paywall model. IP is KING in an AI world," Cuban wrote on July 20.

Google Photos Introduces New AI Tools: Fun, Free, And Very Limited
Google Photos Introduces New AI Tools: Fun, Free, And Very Limited

Forbes

time2 hours ago

  • Forbes

Google Photos Introduces New AI Tools: Fun, Free, And Very Limited

Google is adding new generative AI tools to Google Photos, shifting the app away from its original ... More purpose. Key Takeaways Google Photos could be at the start of a radical transformation. In a major update rolling out now, Google is introducing what could be the most significant Google Photos AI upgrade yet, allowing you to turn static images into animated video clips and stylized art with just a few taps. The tools are free and fun, but are deliberately and severely limited -- and in many ways, that's a good thing. Google's Remix feature turns still images into fun videos with AI. The Big Update: Photo To Video — Fun But Deliberately Nerfed As I previously reported, Google Photos is introducing a game-changing new feature that transforms still photos into short video clips with a single tap. It's a powerful, but significantly cut-down version of the photo-to-video features already available to paying Google AI Pro and Ultra subscribers in Gemini. You can select any picture in your Google Photos library, choose between the vague 'Subtle movement' or slot-machine-like 'I'm feeling lucky' options, and wait for about a minute for the video animation to generate. FEATURED | Frase ByForbes™ Unscramble The Anagram To Reveal The Phrase Pinpoint By Linkedin Guess The Category Queens By Linkedin Crown Each Region Crossclimb By Linkedin Unlock A Trivia Ladder Google's demos show once static people now celebrating by throwing handfuls of confetti in the air before it tumbles back down from above. These were both generated in 'I'm feeling lucky,' mode. I presume additional video effects will be available at launch and more added in the future. If you don't like the results, you can hit the Regenerate button to try again, but that's about it for user control. You can also tap on thumbs-up or thumbs-down icons to send feedback to Google. It would be great to see a few more preset options available, beyond just subtle movements or a random effect. Even adding just a few more emotions would make these clips useful as fun reactions for messaging apps, etc, in place of emojis or pre-made GIFs. The process takes up to a minute to complete, and you The focus here is clearly on fun rather than unbridled creativity. Where Gemini utilizes Google's powerful Veo3 video AI model to create animations of anything you want, Google Photos employs the older Veo 2 model, offering very little user control over what happens in the animation, except for repeatedly hitting the 'Regenerate' button. Furthermore, Veo 2 cannot generate audio, one of the standout features of Veo 3. Remix Your Photos — Too Little, Too Late? First discovered in May of this year, the new 'Remix' feature allows you to select a photo and transform it into a range of artistic styles, including cartoons, pencil sketches, and paintings. Google Photos Remix feature lets you transform photos into a range of artistic styles. As with the Photo to Video feature above, you can hit Regenerate to re-try any pictures you don't like and tap one of the thumb icons to provide feedback. Remix is clearly aimed at having fun and sharing moments in new ways, and there's nothing wrong with that. The results are Google's answer to the viral 'Ghliblified' images and action figure pictures you've probably seen taking over social media. However, unlike powerful tools like ChatGPT or Midjourney where you can simply type in any style imaginable, Remix forces you to pick from a small menu of pre-selected styles. The approach helps keep generated output safe for consumption, but also prevents any real creativity. Google will need to update the library of styles frequently or the novelty will wear off quickly. A New Direction For Google Photos — The Create Tab To make Google Photos' new generative tools easier to find, Google is introducing a new 'Create' tab, accessible by clicking an icon at the bottom of the app on both Android and iOS. Here, you'll be able to find all of Google Photos' creative tools gathered in one place, effectively separating the newer creative side of Google Photos from its original library functions. Google Photos introduces a new "Create" tab to house all of its new generative AI tools. This marks the beginning of a significant shift in purpose for Google Photos, as Google notes, it's now 'more than an archive, it's a canvas.' Personally, that's not what I want from Google Photos; I use it as a place to store and revisit memories rather than as a tool to create new content. The app's existing Animated slide shows and collages use AI to enhance memories, but these new tools alter them into something entirely new, creating video clips of events that never really happened. Google Photos Now Creates, But Is It Safe? Google appears to be exercising considerable caution with these new features, not least by severely limiting the scope of what can be created with these new Google Photos tools. However, the company acknowledges that the results may be 'inaccurate or unexpected' and displays a warning before use, along with a link to its GenAI prohibited use policy. Furthermore, all images and videos generated by Google Photos using AI contain invisible SynthID watermarks that reveal their synthetic origins. The Big Issue: US-Only Rollout Alienates Global Users Photo to Video and Remix are now rolling out on Android and iOS, but are currently only available in the US. The Create tab will then roll out in August, but once again, only in the US. This will be disappointing for international users, who may have to wait a considerable amount of time to access the new features. Remember, Google Photos users outside the US are still waiting for access to the AI-powered 'Ask Photos' feature nine months after launch. Google Photos has a massive worldwide user base, with billions of photos and videos uploaded each week, and runs the risk of frustrating a colossal number of customers if non-US customers remain excluded from its best features. Follow @paul_monckton on Instagram.

Alibaba Cloud Visionary Expects Big Shakeup After OpenAI Hype
Alibaba Cloud Visionary Expects Big Shakeup After OpenAI Hype

Bloomberg

time6 hours ago

  • Bloomberg

Alibaba Cloud Visionary Expects Big Shakeup After OpenAI Hype

OpenAI 's ChatGPT started a revolution in artificial intelligence development and investment. Yet nine-tenths of the technology and services that've sprung up since could be gone in under a decade, according to the founder of Alibaba Group Holding Ltd. 's cloud and AI unit. The problem is the US startup, celebrated for ushering AI into the mainstream, created 'bias' or a skewed understanding of what AI can do, Wang Jian told Bloomberg Television. It fired the popular imagination about chatbots, but the plethora of applications for AI goes far beyond that. Developers need to cut through the noise and think creatively about applications to propel the next stage of AI development, said Wang, who built Alibaba's now second-largest business from scratch in 2009.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store