6 days ago
Data Center Uptime: What It Really Means
Baratunde Cola is the CEO and Founder of Carbice, a company providing high-performance cooling expertise for next-generation technology.
Uptime is a critical metric for any data center, often quoted as "99.9999% uptime." However, most operators typically refer to this percentage as the availability of equipment rather than its actual computational performance.
What does data center uptime really mean? Does it mean the hardware's availability, reliability or something else? Claiming "six-nine" uptime is far from the whole picture, and we're seeing a fundamental gap in how uptime is defined and sold in the market.
One of the main contributors to this gap is inadequate testing, where components are assumed to perform well in real-world conditions simply because they pass controlled tests. Yet these tests can often mask the limitations of the designs in unpredictable real-world conditions, leading to performance degradation and unexpected failures.
Misunderstandings like this lead to a disconnect between what data center vendors promote and what actually matters to businesses that rely on computational uptime to meet business objectives.
The Truth Behind Data Center Uptime And 'Best Performance'
True data center uptime means maintaining reliable performance across the entire ecosystem. Since many factors can hinder this performance, ensuring that systems operate at their highest potential is crucial in supporting business objectives and revenue generation. A well-designed system should maintain its quality and continue to improve over time, much like an aging wine.
Complex liquid-cooled AI servers, for example, have a significantly lower yield during manufacturing compared to typical servers. This is called first-pass yield, or FPY. According to Instrumental, the FPY for these systems "can be as low as 20%, with 50-70% being more typical. For reference high-volume consumer electronics typically drive towards 90%+ FPYs."
A low FPY indicates that the units must undergo multiple tests before passing, which is costly and reduces test capacity, leading to longer development cycles. These issues increase the likelihood of performance degradation over time, even with the most advanced cooling solutions. Data centers rely on hardware functioning properly, as it directly impacts revenue. If the hardware degrades, the system underperforms, even when power and cooling systems are functioning as expected, ultimately resulting in the data center not truly having 99.9999% uptime.
Performance loss does occur frequently and is challenging to address. According to Statista (registration required), the average annual failure rate of servers inside data centers is 5% in the first year and around 11% by the fourth year. Additionally, older systems can experience multiple issues, leading to a significant increase in the failure rate, which highlights the growing risk associated with aging systems.
While most would prefer not to admit failures, it's important to distinguish between performance degradation and an outright outage. Failure often refers to system-level issues addressed by redundancy, such as backups or fail-safes, that can accommodate problems without significant adjustment.
In contrast, the concept of "best performance" is often misunderstood. Performance should be measured not just by initial peaks but by long-term stability and how systems hold up over time. No winemaker would call a fresh batch of wine "the best." This claim is made after the wine has aged for an extended period.
Just as wine only improves with time, a power or cooling system in a data center performs best when the interface design is reliable and evolves over time. The "best performance" doesn't lose efficiency or effectiveness over time. True system performance is realized only over time, under stress and with proper management.
A Key Issue In The Industry: Design Flaws
Today, many disingenuous statistics circulate in the industry. While not intentional, they are often misleading about what can go wrong in data centers.
The difference between what happens within controlled testing and actual performance in the field leads to a disconnect between what data center vendors promote and what businesses actually need. A statistic from Siemens reveals that 70% of all automotive product recalls in Japan (from 2003 to 2007) were initiated in the design phase. Therefore, it's essential to consider that strenuous environments, similar to those found in data center operations, will require significant care in the design phase.
A data center ecosystem can no longer afford to overlook the critical importance of robust design and testing practices. A flawed design process, defined as showcasing product performance under ideal conditions, has led to two-thirds of data center outages costing $100,000 or more.
An effective solution is to adopt an efficient design evaluation approach, which ensures products perform as expected without unnecessary performance loss through actual use case testing, rather than simplified laboratory conditions. Better performance across the product lifecycle means systems can be deployed faster, stay up constantly and remain stable.
The Role Of Thermal Management In Achieving Real Uptime
Thermal management plays a critical role in ensuring consistent hardware performance. Traditional solutions, such as thermal paste, may show improvement in isolated tests, but they don't scale well for real-world use in data centers. Alternatively, aligned carbon nanotube (CNT) technology offers long-term stability and reliable performance, even under extreme workloads, and continues to improve over time.
To achieve true uptime, data centers must invest in better thermal management, power distribution and reliable, serviceable components. Investing in reliable technology, such as high-performance thermal interfaces, ultimately enables businesses to increase performance and profitability.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?