It turns out you can train AI models without copyrighted material
The paper (via The Washington Post ) was a collaboration between 14 different institutions. The authors represent universities like MIT, Carnegie Mellon and the University of Toronto. Nonprofits like Vector Institute and the Allen Institute for AI also contributed.
The group built an 8 TB ethically-sourced dataset. Among the data was a set of 130,000 books in the Library of Congress. After inputting the material, they trained a seven-billion-parameter large language model (LLM) on that data. The result? It performed about as well as Meta's similarly sized Llama 2-7B from 2023. The team didn't publish benchmarks comparing its results to today's top models.
Performance comparable to a two-year-old model wasn't the only downside. The process of putting it all together was also a grind. Much of the data couldn't be read by machines, so humans had to sift through it. "We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people," co-author Stella Biderman told WaPo . "And that's just really hard." Figuring out the legal details also made the process hard. The team had to determine which license applied to each website they scanned.
So, what do you do with a less powerful LLM that's much harder to train? If nothing else, it can serve as a counterpoint.
In 2024, OpenAI told a British parliamentary committee that such a model essentially couldn't exist. The company claimed it would be "impossible to train today's leading AI models without using copyrighted materials." Last year, an Anthropic expert witness added, "LLMs would likely not exist if AI firms were required to license the works in their training datasets."
Of course, this study won't change the trajectory of AI companies. After all, more work to create less powerful tools doesn't jive with their interests. But at least it punctures one of the industry's common arguments. Don't be surprised if you hear about this study again in legal cases and regulation arguments.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Business Wire
35 minutes ago
- Business Wire
HubSpot Launches First CRM Connector for Anthropic's Claude
BOSTON--(BUSINESS WIRE)--More go-to-market teams are using multiple AI tools to move faster—but without customer context, those tools can only go so far. HubSpot is changing that. Today, we're introducing the first CRM connector for Claude—bringing each customer's unique HubSpot context into Claude. With this connector, customers can ask questions in plain language, generate visualizations like charts and graphs, and take action on insights directly in HubSpot. 'The HubSpot connector makes Claude more useful for our customers by grounding its responses in real context from their business,' said Karen Ng, EVP and Head of Product at HubSpot. 'It's a simple, powerful way to get insights you can trust, wherever you work. And it's another step in making advanced AI accessible to every business.' Now, teams can use Claude to get personalized answers that help them work smarter and stay focused on what matters most, with visualizations tailored to their needs: Marketing teams can ask Claude to find contacts who opened a recent email campaign but didn't click through, and generate a pie chart to support follow-up segmentation. Sales teams can ask for a summary of active deals organized by name, amount, and stage, and sorted by closing date to help prioritize the week. Support reps can ask for all open tickets assigned to them, sorted by priority and creation date. Customer success teams can ask Claude to compare resolution approaches and outcomes across support channels to drive consistency and quality. 'Claude is designed to reason through complex questions and deliver thoughtful, conversational answers,' said Scott White, Head of Product, 'With the HubSpot connector, customers can bring their real-time CRM context into Claude—making it more valuable for HubSpot-powered teams looking to gain insight and take action.' We built the HubSpot connector for Claude to ensure users only see the CRM data they're allowed to access in HubSpot. For example, individual sales reps will only see pipeline data for deals they have permission to view. Anthropic does not use the data shared through HubSpot to train its models except in certain instances, like when a customer chooses to provide feedback or opts into training. Customers can refer to the terms (consumer & commercial) in their specific Anthropic plan. The HubSpot connector for Claude is available now to all HubSpot customers across all tiers with a paid Claude subscription (Pro, Max, Team, or Enterprise). It's part of a growing set of AI connectors from HubSpot designed to bring CRM context into the LLMs our customers use every day. About HubSpot HubSpot (NYSE: HUBS) is the leading AI-powered customer platform for growing businesses. The platform includes engagement hubs for marketing, sales, and customer service, a connected Smart CRM, and an ecosystem of over 1,800 integrations—all built on a unified data foundation that powers HubSpot's AI and enables smarter, faster, more personalized customer experiences. More than 250,000 customers across 135+ countries use HubSpot to unify their data, align their teams, and grow better. HubSpot is headquartered in Cambridge, Massachusetts. Learn more at


Forbes
an hour ago
- Forbes
AI Agents May End Search As We Know It
Perry Carpenter is Chief Human Risk Management Strategist for KnowBe4, a cybersecurity platform that addresses human risk management. When the World Wide Web first appeared, it was compared to a library without shelves, where all the books lay scattered on a basement floor. For over two decades, search engines have served as ubiquitous routers directing billions of clickers through the labyrinth of the internet. But that era is officially ending. At Google's 2025 I/O conference, the company unveiled a future where AI agents, not search results, will act as the new custodians of the web. No more sifting through pages or gaming results. Instead, these agents promise to fetch answers, shop for deals and even draft research papers. This shift signals two concurrent trends that business leaders can't afford to ignore: a new battleground for brand visibility and a growing dependence on agentic systems for customer interaction. From Links To Living Assistants The star of the I/O 2025 event wasn't a new gadget or algorithm tweak; it was the quiet rise of AI agents, tools designed to erase the friction between users and information. The message was clear: Search, in its traditional form, is obsolete. The most significant announcement was the rollout of AI Mode, a feature that allows users to converse directly with an AI assistant within Google Search. Instead of scrolling through links, users can now ask the agent to summarize articles, compare products or compile travel itineraries. The ChatGPT Hangover Effect This seismic shift didn't happen overnight. It was catalyzed by the rise of ChatGPT, which showed limitations to the search-first approach. When OpenAI's generative AI tool (chatbot) demonstrated that users preferred conversational answers over link lists, search engines scrambled to adapt. Last year's rollout of AI Overviews—a feature that placed AI-generated summaries atop search results—was a tentative first step for Google. But its launch faltered when the system "hallucinated" absurd answers, like recommending glue on pizza to make the cheese stick, claiming Barack Obama was the first Muslim president and suggesting that users eat at least one small rock per day. A year later, Google has refined its strategy. Rather than tacking AI onto search, it's rebuilding the entire experience around agentic AI. That should prompt leaders to ask: Is your digital infrastructure agent-ready? In support of this, Google announced native SDK support for Anthropic's MCP, a protocol that lets AI agents securely access data across apps and websites. The move aligns with broader industry trends: Microsoft CTO Kevin Scott recently championed an "open agentic web," where AI tools seamlessly interact with services, from hailing Uber to managing finances. Winners, Losers And Unanswered Questions The implications of this shift are profound and uneven. For online services like DoorDash or Ticketmaster, AI agents represent a golden opportunity. These platforms are already optimizing their systems for AI interactions, eager to serve users through automated assistants. A customer asking Gemini to "order my usual lunch" could trigger a DoorDash order without needing to open the app. Publishers, however, face existential threats. When AI agents summarize news articles or research papers, readers have little incentive to click through to the source. Traffic and its associated ad revenue may drop. For media companies, that kind of efficiency becomes a big problem. Outlets like The New York Times and Reuters have put up paywalls and made licensing deals, but the long-term viability of these purveyors of truth, backed by full-time fact-checkers, is worrisome, to say the least. Then there's the nagging hallucination issue. Despite advancements, AI agents still invent facts or misrepresent sources. Until these gaps close, the risk of misinformation looms large. Imagine an AI agent citing a fake clinical study or misidentifying a patient's diagnoses. Notably, the recent "Make America Healthy Again" report from HHS cited sources in the appendix that didn't exist. This is widely believed to be because the report likely over-relied on AI. Especially for leaders in regulated industries like healthcare, finance and legal—the reputational and compliance risks of AI-generated misinformation can be high. AI Security Risks Borne By Users AI agents can usher in a host of security concerns as well, something that's been well documented. Agentic AI can be utilized by threat actors to automate social engineering attacks, reaching thousands of users simultaneously. Prompt injection attacks can manipulate LLMs, seeding malware in the process. Human defenses are needed to thwart AI-fueled spearphishing campaigns, the main vector for system infiltration. The need for human risk management grows apparent. Human risk management (HRM) is a data-driven approach that aligns with traditional security awareness training but also focuses more broadly on quantifying and mitigating risks associated with human behavior. The idea is to transform employees into a strong layer of defense through personalized, adaptive learning and fostering a culture that is security resilient. We Stand At A Crossroads The promise of agentic AI is seductive: a world where tedious tasks vanish, research takes minutes and education is frictionless. Yet, we risk trading serendipity and depth for convenience, empowering algorithms to curate reality on our behalf. We risk the human talent for critical thinking by outsourcing it to a black box. The way many see dashboard and phone-based GPS systems has turned us into more careless drivers, agentic AI could make us less consciously mindful and maybe less intuitive. But despite flaws, AI agents are here to stay and evolve. I am certain other competitive search engines will follow suit with their own AI mega-assistants. The question is how the choices executives make today—about partnerships, data access, transparency and customer experience—will shape the AI agent ecosystem for years to come. Regulations ensuring things like transparency, cybersecurity safeguards, fair compensation for creators and rigorous AI accountability will be critical. So, too, will preserving spaces for human intuition and curiosity—the messy, exploratory web journeys that algorithms can't replicate. The age of search taught us to navigate the internet. The age of agents will test whether we can still navigate ourselves. Forbes Business Council is the foremost growth and networking organization for business owners and leaders. Do I qualify?
Yahoo
an hour ago
- Yahoo
Britain set for minor boost as world economy growth forecast upgraded despite Trump tariffs
The world economy will grow faster than expected after the impact of Donald Trump's tariff war waned and imports to the US surged, new projections show. Britain is in line for a small boost with the IMF upgrading its growth predictions by just 0.1 per cent for 2025. Growth is expected to sit at 1.2 per cent this year and 1.4 per cent for 2026, the International Monetary Fund's (IMF) latest World Economic Outlook forecasts. It came as the IMF said the global economy would be more resilient than it expected in May. Global growth is set to be 3 per cent this year and 3.1 per cent next year, up from 2.8 per cent and 3 per cent respectively. Like Britain, Germany and Italy have both had their growth upgraded by 0.1 per cent for 2025, and received no upgrade for 2026. Meanwhile, Canada was handed a marginally bigger upgrade of 0.2 per cent in 2025 and 0.3 per cent in 2026, taking its growth to 1.6 per cent in 2025 and 1.9 per cent in 2026. The world growth upgrade reflects what the IMF called front-loading in recent months, with countries rushing to ship goods to the US to get ahead of Mr Trump's tariffs. This has happened as businesses and households tried to prepare for planned increases to US tariff rates, following Mr Trump's "liberation day" announcements in April, according to the report. The IMF said front-loading had "shaped economic activity in the first half of the year", adding that it was "creating exposures that could amplify the impact of any potential negative shocks". For example, firms could end up having too much stock, therefore pushing down future imports, or it could lead to additional holding costs or the risk of items becoming obsolete. Meanwhile, the growth upgrade since April was also driven by US tariffs being lowered since higher rates were first announced by Mr Trump, alongside improved conditions in the financial markets. This came after the US struck new trade deals, including with the UK and, most recently, the EU. But the UK's growth rate flatlining while trade deals benefited other economies poses fresh questions for Rachel Reeves. The chancellor has been accused of stalling growth with her and Sir Keir Starmer's economic doom and gloom after taking office last July and through October's tax-hiking Budget. She has a multi-billion pound black hole to fill ahead of this autumn's Budget and is considering a range of potential tax hikes and spending cuts to balance the country's books. The Conservatives said under Labour 'growth is going nowhere', adding that the IMF report confirms that. Shadow chancellor Mel Stride said: 'Business confidence has collapsed all because of the chancellor's reckless economic choices. "You can't tax your way to growth - we need to back British businesses and workers. "Yet Rachel Reeves looks set to do it all over again in the autumn - yet more taxes, yet more pain for our economy." But Ms Reeves said the forecast still forecasts the UK to have the highest growth in the G7. The chancellor said: 'I am determined to unlock Britain's full potential, which is why we are investing billions of pounds through our Plan for Change - in jobs through better city region transport, record funding for affordable homes, as well as backing major projects like Sizewell C to drive economic growth and put more money into people's pockets.' Countries also benefited via the introduction of some higher tariff rates while others were paused until August, notably between China and the US, helping diffuse escalating trade tensions and open the door to negotiations. However, the IMF warned that a "rebound in effective tariff rates could lead to weaker growth" and weigh on wider sentiment. "Elevated uncertainty could start weighing more heavily on activity, also as deadlines for additional tariffs expire without progress on substantial, permanent agreements," the report said. Furthermore, the IMF flagged conflict in the Middle East creating potential risks to global shipping and trade, which could further raise commodity prices like oil. On the other hand, the report found that global growth could be lifted if trade negotiations lead to lower tariffs, ease tensions, and create more certainty and predictability. The IMF also highlighted technological advancements, including the use of artificial intelligence (AI), as a way to further boost growth around the world.