
UPDATE – New MLPerf Storage v2.0 Benchmark Results Demonstrate the Critical Role of Storage Performance in AI Training Systems
The benchmark results show that storage systems performance continues to improve rapidly, with tested systems serving roughly twice the number of accelerators than in the v1.0 benchmark round.
Additionally, the v2.0 benchmark adds new tests that replicate real-world checkpointing for AI training systems. The benchmark results provide essential information for stakeholders who need to configure the frequency of checkpoints to optimize for high performance – particularly at scale.
Version 2.0 adds checkpointing tasks, delivers essential insights
As AI training systems have continued to scale up to billions and even trillions of parameters, and the largest clusters of processors have reached one hundred thousand accelerators or more, system failures have become a prominent technical challenge. Because data centers tend to run accelerators at near-maximum utilization for their entire lifecycle, both the accelerators themselves and the supporting hardware (power supplies, memory, cooling systems, etc.) are heavily burdened, minimizing their expected lifetime. This is a chronic issue, especially in large clusters: if the mean time to failure for an accelerator is 50,000 hours, then a 100,000-accelerator cluster running for extended periods at full utilization will likely experience a failure every half-hour. A cluster with one million accelerators would expect to see a failure every three minutes. Worse, because AI training usually involves massively parallel computation where all the accelerators are moving in lockstep on the same iteration of training, a failure of one processor can grind an entire cluster to a halt.
It is now broadly accepted that saving checkpoints of intermediate training results at regular intervals is essential to keep AI training systems running at high performance. The AI training community has developed mathematical models that can optimize cluster performance and utilization by trading off the overhead of regular checkpoints against the expected frequency and cost of failure recovery (rolling back the computation, restoring the most recent checkpoint, restarting the training from that point, and duplicating the lost work). Those models, however, require accurate data on the scale and performance of the storage systems that are used to implement the checkpointing system.
The MLPerf Storage v2.0 checkpoint benchmark tests provide precisely that data, and the results from this round suggest that stakeholders procuring AI training systems need to carefully consider the performance of the storage systems they buy, to ensure that they can store and retrieve a cluster's checkpoints without slowing the system down to an unacceptable level. For a deeper understanding of the issues around storage systems and checkpointing, as well as of the design of the checkpointing benchmarks, we encourage you to read this post from Wes Vaske, a member of the MLPerf Storage working group.
'At the scale of computation being implemented for training large AI models, regular component failures are simply a fact of life,' said Curtis Anderson, MLPerf Storage working group co-chair. 'Checkpointing is now a standard practice in these systems to mitigate failures, and we are proud to be providing critical benchmark data on storage systems to allow stakeholders to optimize their training performance. This initial round of checkpoint benchmark results shows us that current storage systems offer a wide range of performance specifications, and not all systems are well-matched to every checkpointing scenario. It also highlights the critical role of software frameworks such as PyTorch and TensorFlow in coordinating training, checkpointing, and failure recovery, as well as some opportunities for enhancing those frameworks to further improve overall system performance.'
Workload benchmarks show rapid innovation in support of larger-scale training systems
Continuing from the v1.0 benchmark suite, the v2.0 suite measures storage performance in a diverse set of ML training scenarios. It emulates the storage demands across several scenarios and system configurations covering a range of accelerators, models, and workloads. By simulating the accelerators' 'think time' the benchmark can generate accurate storage patterns without the need to run the actual training, making it more accessible to all. The benchmark focuses the test on a given storage system's ability to keep pace, as it requires the simulated accelerators to maintain a required level of utilization.
The v2.0 results show that submitted storage systems have substantially increased the number of accelerators they can simultaneously support, roughly twice the number compared to the systems in the v1.0 benchmark.
'Everything is scaling up: models, parameters, training datasets, clusters, and accelerators. It's no surprise to see that storage system providers are innovating to support ever larger scale systems,' said Oana Balmau, MLPerf Storage working group co-chair.
The v2.0 submissions also included a much more diverse set of technical approaches to delivering high-performance storage for AI training, including: 6 local storage solutions;
2 solutions using in-storage accelerators;
13 software-defined solutions;
12 block systems;
16 on-prem shared storage solutions;
2 object stores.
'Necessity continues to be the mother of invention: faced with the need to deliver storage solutions that are both high-performance and at unprecedented scale, the technical community has stepped up once again and is innovating at a furious pace,' said Balmau.
MLPerf Storage v2.0: skyrocketing participation and diversity of submitters
The MLPerf Storage benchmark was created through a collaborative engineering process by 35 leading storage solution providers and academic research groups across 3 years. The open-source and peer-reviewed benchmark suite offers a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI training systems.
The v2.0 benchmark results, from a broad set of technology providers, reflect the industry's recognition of the importance of high-performance storage solutions. MLPerf Storage v2.0 includes >200 performance results from 26 submitting organizations: Alluxio, Argonne National Lab, DDN, ExponTech, FarmGPU, H3C, Hammerspace, HPE, JNIST/Huawei, Juicedata, Kingston, KIOXIA, Lightbits Labs, MangoBoost, Micron, Nutanix, Oracle, Quanta Computer, Samsung, Sandisk, Simplyblock, TTA, UBIX, IBM, WDC, and YanRong. The submitters represent seven different countries, demonstrating the value of the MLPerf Storage benchmark to the global community of stakeholders.
'The MLPerf Storage benchmark has set new records for an MLPerf benchmark, both for the number of organizations participating and the total number of submissions,' said David Kanter, Head of MLPerf at MLCommons. The AI community clearly sees the importance of our work in publishing accurate, reliable, unbiased performance data on storage systems, and it has stepped up globally to be a part of it. I would especially like to welcome first-time submitters Alluxio, ExponTech, FarmGPU, H3C, Kingston, KIOXIA, Oracle, Quanta Cloud Technology, Samsung, Sandisk, TTA, UBIX, IBM, and WDC.'
'This level of participation is a game-changer for benchmarking: it enables us to openly publish more accurate and more representative data on real-world systems,' Kanter continued. 'That, in turn, gives the stakeholders on the front lines the information and tools they need to succeed at their jobs. The checkpoint benchmark results are an excellent case in point: now that we can measure checkpoint performance, we can think about optimizing it.'
We invite stakeholders to join the MLPerf Storage working group and help us continue to evolve the benchmark suite.
View the Results
To view the results for MLPerf Storage v2.0, please visit the Storage benchmark results .
About MLCommons
MLCommons is the world's leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies' accuracy, safety, speed, and efficiency.
For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email [email protected] .
Disclaimer: The above press release comes to you under an arrangement with GlobeNewswire. Business Upturn takes no editorial responsibility for the same.
Ahmedabad Plane Crash
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles
Yahoo
4 minutes ago
- Yahoo
Palantir shares jump as soaring AI demand powers forecast upgrade
(Reuters) -Palantir Technologies shares rose 5% before the bell on Tuesday, after strong demand for its AI-powered services across governments and commercial businesses prompted an increase in its annual revenue forecast. Investors have been betting big on the data analytics and defense software company's military-grade artificial intelligence tools and services, which have propelled its shares to more than double in value this year, making them the best performer on the S&P 500 index through last close. "Palantir's staggering growth is showing no signs of slowing... and (its) ability to grow at scale has been underestimated by a large cohort of the market," said Matt Britzman, senior equity analyst at Hargreaves Lansdown. The company raised its annual revenue forecast for the second time this year and above Wall Street estimates. Sales to the U.S. government jumped 53% to $426 million, representing more than 42% of the total second-quarter revenue of about $1 billion. Last week, the U.S. Army said it might spend up to $10 billion on Palantir's services over the next decade. The Denver, Colorado-based company, co-founded by Peter Thiel, expects expenses to rise significantly in the third quarter due to seasonal hiring amid rising competition among industry leading tech firms to poach top talent, as businesses rapidly look to adopt AI. The stock trades at over 200 times its 12-month forward earnings estimates, compared with AI giant Nvidia's 34.81 and S&P 500's 27.44. Jefferies analysts cautioned that there is a "disconnect between valuation and achievable growth". At least six brokerages raised their price targets on the stock after the results. Sign in to access your portfolio


Entrepreneur
6 minutes ago
- Entrepreneur
TurboHire Secures USD 6 Mn in Series A Funding Led by IvyCap Ventures
The Hyderabad-based startup plans to use the capital to expand its global presence, enhance product capabilities, and strengthen integration across human resource technology ecosystems. You're reading Entrepreneur India, an international franchise of Entrepreneur Media. Enterprise recruitment automation startup TurboHire has secured USD 6 million in Series A funding led by IvyCap Ventures. The Hyderabad-based startup plans to use the capital to expand its global presence, enhance product capabilities, and strengthen integration across human resource technology ecosystems. TurboHire was founded in 2019 by Deepak Agrawal, Rakesh Nayak, and Gaurav Kumar with the aim of addressing inefficiencies in enterprise recruitment. The platform uses agentic artificial intelligence, advanced analytics, and deep integration with major human resource management systems such as SAP SuccessFactors, Oracle HCM, and Workday. It supports hiring across scenarios from large-scale recruitment events to senior executive searches through a single configurable interface. Deepak Agrawal, CEO and Co-founder of TurboHire, said, "Enterprises are under intense pressure to transform, but traditional hiring systems are rigid, slow, and drain executive time. TurboHire brings together AI-powered automation, workflows, and data on one single enterprise-grade platform—with exceptional configurability and integrability. It's a game changer for companies serious about AI-led hiring transformation. With IvyCap's support, we're poised to scale this globally." The company's offerings operate on a business-to-business software-as-a-service model. Enterprises subscribe to the platform to automate and streamline recruitment processes, reduce time-to-hire, and improve overall efficiency. TurboHire's client base includes over 120 organisations such as Cipla, Tata Motors, PwC, Lenskart, Britannia, RPG Group, Motilal Oswal Group, and Ola. Vikram Gupta, Founder and Managing Partner, IvyCap Ventures, added, "TurboHire is not just solving a business problem—it's redefining how enterprises approach talent acquisition in an AI-first world. Their ability to serve large, complex organisations with a configurable, outcome-oriented platform is impressive. With a strong product-market fit, growing global demand, and a future-ready tech stack, we believe TurboHire is well-positioned to lead the next wave of intelligent hiring transformation." With its new funding, TurboHire aims to continue building its capabilities and expanding its reach in the global market. The company's focus remains on delivering technology that enables organisations to treat hiring as a strategic advantage in an increasingly digital business landscape.

Associated Press
7 minutes ago
- Associated Press
Scale AI Without Limits: KIOXIA Showcases Breakthrough Flash Storage Solutions at FMS 2025
TOKYO--(BUSINESS WIRE)--Aug 5, 2025-- Kioxia group, a world leader in memory solutions, will once again take center stage at FMS: the Future of Memory and Storage to spotlight how its flash memory and SSD innovations are driving scalable, efficient infrastructure for artificial intelligence (AI). With a focus on real-world applications and performance gains, Kioxia will demonstrate how its latest solutions meet the evolving demands of AI across data center and enterprise environments. This press release features multimedia. View the full release here: Scale AI Without Limits: KIOXIA Showcases Breakthrough Flash Storage Solutions at FMS 2025 Kioxia will spotlight its newest products, including the high-capacity KIOXIA LC9 Series - the industry's first 1 245.76 terabyte (TB) 2 NVMe™ SSD. Additional highlights include the KIOXIA CM9 Series and KIOXIA CD9P Series SSDs built with the company's BiCS FLASH TM generation 8 3D flash memory, delivering performance, power efficiency, and versatility. Kioxia will also display 1 terabit (Tb) 3 3bit/cell (TLC) products with BiCS FLASH TM generation 9 and generation 10 3D flash memory technology. BiCS FLASH TM generation 9 and generation 10 3D flash memory incorporate interface standards, Toggle DD6.0, and leverage the SCA (Separate Command Address) protocol and a novel command address input method of its interface, and PI-LTT (Power Isolated Low-Tapped Termination) technology, which provide improvements in data input and output speed and power consumption reduction. At FMS 2025, Kioxia will give a keynote presentation and technical sessions covering a range of topics: Related Link: FMS 2025 Official Site Notes: 1: As of August 5, 2025. Based on Kioxia Corporation survey. 2: Definition of SSD capacity: Kioxia Corporation defines a kilobyte (KB) as 1,000 bytes, a megabyte (MB) as 1,000,000 bytes, a gigabyte (GB) as 1,000,000,000 bytes, a terabyte (TB) as 1,000,000,000,000 bytes, and a kibibyte (KiB) is 1,024 bytes. A computer operating system, however, reports storage capacity using powers of 2 for the definition of 1GB = 2^30 bytes = 1,073,741,824 bytes and 1TB = 2^40 bytes = 1,099,511,627,776 bytes and therefore shows less storage capacity. Available storage capacity (including examples of various media files) will vary based on file size, formatting, settings, software and operating system, and/or pre-installed software applications, or media content. Actual formatted capacity may vary. 3: The flash memory capacity is calculated as 1 terabit (1 Tb) = 1,099,511,627,776 (2^40) bits, and 1 terabyte (1 TB) = 1,099,511,627,776 (2^40) bytes. Universal Flash Storage (UFS) is a product category for a class of embedded memory products built to the JEDEC UFS standard specification. NVMe is a registered or unregistered mark of NVM Express, Inc. in the United States and other countries. CXL is a trademark of Compute Express Link Consortium, Inc. Dell, PowerEdge and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other company names, product names and service names may be trademarks of third-party companies. About Kioxia Kioxia is a world leader in memory solutions, dedicated to the development, production and sale of flash memory and solid-state drives (SSDs). In April 2017, its predecessor Toshiba Memory was spun off from Toshiba Corporation, the company that invented NAND flash memory in 1987. Kioxia is committed to uplifting the world with 'memory' by offering products, services and systems that create choice for customers and memory-based value for society. Kioxia's innovative 3D flash memory technology, BiCS FLASH™, is shaping the future of storage in high-density applications, including advanced smartphones, PCs, automotive systems, data centers and generative AI systems. *Information in this document, including product prices and specifications, content of services and contact information, is correct on the date of the announcement but is subject to change without prior notice. View source version on CONTACT: Media Inquiries: Kioxia Corporation Promotion Management Division Satoshi Shindo Tel: +81-3-6478-2404 KEYWORD: UNITED STATES JAPAN NORTH AMERICA ASIA PACIFIC CALIFORNIA INDUSTRY KEYWORD: SOFTWARE OTHER MANUFACTURING HARDWARE ARTIFICIAL INTELLIGENCE DATA MANAGEMENT TECHNOLOGY SEMICONDUCTOR MANUFACTURING SOURCE: Kioxia Corporation Copyright Business Wire 2025. PUB: 08/05/2025 05:30 AM/DISC: 08/05/2025 05:30 AM