Popular AIs head-to-head: OpenAI beats DeepSeek on sentence-level reasoning
An AI model 'reasons' by breaking down a query into steps and working through them in order. Think of how you learned to solve math word problems in school.
Ideally, to generate citations an AI model would understand the key concepts in a document, generate a ranked list of relevant papers to cite, and provide convincing reasoning for how each suggested paper supports the corresponding text. It would highlight specific connections between the text and the cited research, clarifying why each source matters.
The question is, can today's models be trusted to make these connections and provide clear reasoning that justifies their source choices? The answer goes beyond citation accuracy to address how useful and accurate large language models are for any information retrieval purpose.
I'm a computer scientist. My colleagues − researchers from the AI Institute at the University of South Carolina, Ohio State University and University of Maryland Baltimore County − and I have developed the Reasons benchmark to test how well large language models can automatically generate research citations and provide understandable reasoning.
We used the benchmark to compare the performance of two popular AI reasoning models, DeepSeek's R1 and OpenAI's o1. Though DeepSeek made headlines with its stunning efficiency and cost-effectiveness, the Chinese upstart has a way to go to match OpenAI's reasoning performance.
The accuracy of citations has a lot to do with whether the AI model is reasoning about information at the sentence level rather than paragraph or document level. Paragraph-level and document-level citations can be thought of as throwing a large chunk of information into a large language model and asking it to provide many citations.
In this process, the large language model overgeneralizes and misinterprets individual sentences. The user ends up with citations that explain the whole paragraph or document, not the relatively fine-grained information in the sentence.
Further, reasoning suffers when you ask the large language model to read through an entire document. These models mostly rely on memorizing patterns that they typically are better at finding at the beginning and end of longer texts than in the middle. This makes it difficult for them to fully understand all the important information throughout a long document.
Large language models get confused because paragraphs and documents hold a lot of information, which affects citation generation and the reasoning process. Consequently, reasoning from large language models over paragraphs and documents becomes more like summarizing or paraphrasing.
The Reasons benchmark addresses this weakness by examining large language models' citation generation and reasoning.
Following the release of DeepSeek R1 in January 2025, we wanted to examine its accuracy in generating citations and its quality of reasoning and compare it with OpenAI's o1 model. We created a paragraph that had sentences from different sources, gave the models individual sentences from this paragraph, and asked for citations and reasoning.
To start our test, we developed a small test bed of about 4,100 research articles around four key topics that are related to human brains and computer science: neurons and cognition, human-computer interaction, databases and artificial intelligence. We evaluated the models using two measures: F-1 score, which measures how accurate the provided citation is, and hallucination rate, which measures how sound the model's reasoning is − that is, how often it produces an inaccurate or misleading response.
Our testing revealed significant performance differences between OpenAI o1 and DeepSeek R1 across different scientific domains. OpenAI's o1 did well connecting information between different subjects, such as understanding how research on neurons and cognition connects to human-computer interaction and then to concepts in artificial intelligence, while remaining accurate. Its performance metrics consistently outpaced DeepSeek R1's across all evaluation categories, especially in reducing hallucinations and successfully completing assigned tasks.
OpenAI o1 was better at combining ideas semantically, whereas R1 focused on making sure it generated a response for every attribution task, which in turn increased hallucination during reasoning. OpenAI o1 had a hallucination rate of approximately 35% compared with DeepSeek R1's rate of nearly 85% in the attribution-based reasoning task.
In terms of accuracy and linguistic competence, OpenAI o1 scored about 0.65 on the F-1 test, which means it was right about 65% of the time when answering questions. It also scored about 0.70 on the BLEU test, which measures how well a language model writes in natural language. These are pretty good scores.
DeepSeek R1 scored lower, with about 0.35 on the F-1 test, meaning it was right about 35% of the time. However, its BLEU score was only about 0.2, which means its writing wasn't as natural-sounding as OpenAI's o1. This shows that o1 was better at presenting that information in clear, natural language.
On other benchmarks, DeepSeek R1 performs on par with OpenAI o1 on math, coding and scientific reasoning tasks. But the substantial difference on our benchmark suggests that o1 provides more reliable information, while R1 struggles with factual consistency.
Though we included other models in our comprehensive testing, the performance gap between o1 and R1 specifically highlights the current competitive landscape in AI development, with OpenAI's offering maintaining a significant advantage in reasoning and knowledge integration capabilities.
These results suggest that OpenAI still has a leg up when it comes to source attribution and reasoning, possibly due to the nature and volume of the data it was trained on. The company recently announced its deep research tool, which can create reports with citations, ask follow-up questions and provide reasoning for the generated response.
The jury is still out on the tool's value for researchers, but the caveat remains for everyone: Double-check all citations an AI gives you.
This article is republished from The Conversation, a nonprofit, independent news organization bringing you facts and trustworthy analysis to help you make sense of our complex world. It was written by: Manas Gaur, University of Maryland, Baltimore County
Read more:
Why building big AIs costs billions – and how Chinese startup DeepSeek dramatically changed the calculus
What is an AI agent? A computer scientist explains the next wave of artificial intelligence tools
AI pioneers want bots to replace human teachers – here's why that's unlikely
Manas Gaur receives funding from USISTEF Endowment Fund.

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


UPI
37 minutes ago
- UPI
Trump announces creation of 'AI economy' during innovation summit
July 15 (UPI) -- Pennsylvanians and the nation will benefit from $100 billion in energy- and artificial intelligence-related investments announced on Tuesday to energize the nation's growing AI economy. The investments should create tens of thousands of new jobs for Pennsylvanians in the energy and AI sectors while helping the United States improve its economy and global AI standing, President Donald Trump said during Tuesday's inaugural Pennsylvania Energy and Innovation Summit in Pittsburgh. "We're here today because we believe America's destiny is to dominate every industry and be the first in every technology," Trump told attendees. "That includes being the world's No. 1 superpower in artificial intelligence," he added. The president said the United States is "way ahead of China" in AI development and has many plants under construction. "China and other countries are racing to catch up to America having to do with AI," Trump said. "We're not going to let them do it," he said. "We have the great chips [and] the great everything." Trump said the United States is "going to be fighting them in a very friendly fashion," adding that he and Chinese President Xi Jinping have a "great relationship." "Remaining the world's leader in AI will require an enormous increase in energy production," Trump told the audience. He said "clean, beautiful coal" and oil production will be a key element in producing more electrical power to support AI endeavors in the United States and to stay ahead of China in AI development. More than $56 billion in new energy infrastructure and $36 billion in new data projects were announced on Tuesday, the president said. A $15 billion investment by Knighthead Capital Management will create the largest natural gas-fired power generation plant in North America in Homer City, Pa. Google also is investing "billions and billions" to revitalize two hydropower facilities in the commonwealth, Trump added. Westinghouse officials also have announced that the company will build several nuclear power plants throughout the nation to ensure the AI economy has ample energy available. "A lot more than that will be announced in the coming weeks and months," Trump added. The president said 20 "leading technology and energy companies" are poised to invest in Pennsylvania to develop an AI economy that utilizes the commonwealth's energy and technology assets, CBS News reported. Many firms are investing elsewhere in the country, too, in order to support the nation's AI economy, according to the New York Post. Trump spoke for about 30 minutes during the hour-long Pennsylvania Energy and Innovation Summit, which was organized by Sen Dave McCormick, R-Pa., and held on the campus of Carnegie Mellon University. Pennsylvania's Democratic Gov. Josh Shapiro and others joined Trump and McCormick to discuss energy matters and the growth of AI in the United States.
Yahoo
41 minutes ago
- Yahoo
Playing an instrument may help protect brain against ageing, study finds
Playing a musical instrument could help protect the brain against age-related decline, new research has suggested. Older adults with long-term musical training performed better at understanding speech in noisy environments and showed brain connectivity patterns closer to younger people, according to a study published in the journal PLOS Biology. Researchers from Baycrest Academy for Research and Education in Canada and the Chinese Academy of Sciences used functional MRI scans to compare brain activity in 25 older musicians, 25 older non-musicians and 24 young non-musicians. The participants were asked to identify syllables masked by background noise, a task that typically becomes harder with age. While older non-musicians showed the usual age-related increase in neural activity and connectivity - a sign that the brain is working harder to compensate for decline - older musicians maintained a more 'youth-like' pattern. The strength of connections in certain brain networks also correlated with better performance on the speech-in-noise task, the study found. Older adults with long-term musical training performed better at understanding speech in noisy environments and showed brain connectivity patterns closer to younger people (Getty Images) The findings support what the researchers call the 'Hold-Back Upregulation' hypothesis. This is when the cognitive reserve built through musical training helps the brain hold onto its younger functional features, instead of simply compensating for loss. Dr Yi Du, co-author of the study, said: 'Just like a well-tuned instrument doesn't need to be played louder to be heard, the brains of older musicians stay finely tuned thanks to years of training. 'Our study shows that this musical experience builds cognitive reserve, helping their brains avoid the usual age-related overexertion when trying to understand speech in noisy places.' The authors said that although the study cannot prove cause and effect, it adds to growing evidence that positive lifestyle choices, such as musical training, higher education and bilingualism, can help the brain cope better with ageing. Dr Lei Zhang, another co-author, added: 'A positive lifestyle helps older adults cope better with cognitive ageing, and it is never too late to take up, and stick with, a rewarding hobby such as learning an instrument.' The scientists are now planning to explore whether other activities, such as exercise and multilingualism, could offer similar benefits.

USA Today
3 hours ago
- USA Today
Apple inks $500 million raw materials deal to boost US supply chain
Apple AAPL.O has signed a $500-million deal with Pentagon-backed MP Materials MP.N for a supply of rare earth magnets, becoming one of the first tech companies to ink a U.S. supply agreement after China curbed exports earlier this year. The move reflects strong backing for Las Vegas-backed MP by one of the world's most valuable companies, coming just days after the U.S. government said it would become MP's largest shareholder. Both mark an amazing about-face for MP after it contemplated merging with an Australian rival last year just to survive. MP shares jumped 21% in Tuesday trading to a record high, while Apple's stock gained under 1%. The deal, announced on Tuesday, guarantees Apple a steady flow of rare earths and magnets free from China - the world's largest producer. For Apple, the cost to support U.S. magnet production paled in comparison to the long-term risk that it could lose access entirely to the critical components, analysts said. MP last week agreed to a multibillion-dollar deal with the U.S. Department of Defense that will see the Pentagon become MP's largest shareholder and financial backstop. "Any time you have government ownership, that's a huge vote of confidence,"said Gracelin Baskaran, director of the critical minerals security program at the Center for Strategic and International Studies. "We're in an era where executives are willing to pay a significant premium for a reliable supply chain. They don't want stoppage." Neither the precise length of the deal nor the specific volumes of magnets to be supplied was provided, although the agreement does call for magnets produced from recycled material, in keeping with Apple's long-standing goal of ending its reliance on the mining industry. Rare earths are a group of 17 metals used to make magnets that turn power into motion, including the devices that make cell phones vibrate. They are also used to make weapons, electric vehicles, and many other electronics. China halted exports in March following a trade spat with U.S. President Donald Trump that showed some signs of easing late last month, even as broader tensions underscored demand for non-Chinese supply. In case you missed it: Apple's $95 million Siri settlement deadline nears: How to get your cash As part of the agreement, Apple will prepay MP Materials $200 million for a supply of magnets slated to begin in 2027. The magnets will be produced at MP's Fort Worth, Texas, facility using magnets recycled at MP's Mountain Pass, California, mining complex, the companies said. "Rare earth materials are essential for making advanced technology, and this partnership will help strengthen the supply of these vital materials here in the United States," Apple CEO Tim Cook said in a statement. Bob O'Donnell, president at market research firm TECHnalysis Research, said Tuesday's move "makes complete sense" given that Apple requires significant amounts of rare earth magnets for its devices. "Plus, by focusing on a U.S.-based supplier, it does help position Apple more positively in Washington," he said. Apple, which said the deal is part of its $500 billion four-year investment commitment to the U.S., has faced threats from Trump over iPhones not made in the U.S. But many analysts have said making the iPhone in the U.S. is not possible, given labor costs and the existing smartphone supply chain. Apple, which sold about 232 million iPhones last year, according to data from IDC, did not disclose which devices in which it will use the magnets. MP said the deal will supply magnets for hundreds of millions of devices, which would constitute a significant share of any of Apple's product lines, which also include wearable devices such as watches and earbuds. MP already produces mined and processed rare earths and has said it expects to start commercial magnet production in its Texas facility by the end of this year. The company already has a magnet supply deal with General Motors GM.N and Germany's Vacuumschmelze. Last week's deal between MP and the U.S. government includes a price floor for rare earths designed to spur investment in domestic mines and processing plants, which has been lagging partly due to low prices set in China. Reporting by Ernest Scheyder in Houston; additional reporting by Zaheer Kachwala in Bengaluru, Eric Onstad in London and Stephen Nellis in San Francisco; Editing by Bernadette Baum, Shinjini Ganguli, Rod Nickel and Marguerita Choy