logo
Red Hat launches enterprise AI inference server for hybrid cloud

Red Hat launches enterprise AI inference server for hybrid cloud

Techday NZ21-05-2025

Red Hat has introduced Red Hat AI Inference Server, an enterprise-grade offering aimed at enabling generative artificial intelligence (AI) inference across hybrid cloud environments.
The Red Hat AI Inference Server emerges as an offering that leverages the vLLM community project, initially started by the University of California, Berkeley. Through Red Hat's integration of Neural Magic technologies, the solution aims to deliver higher speed, improved efficiency with a range of AI accelerators, and reduced operational costs. The platform is designed to allow organisations to run generative AI models on any AI accelerator within any cloud infrastructure.
The solution can be deployed as a standalone containerised offering or as part of Red Hat Enterprise Linux AI (RHEL AI) and Red Hat OpenShift AI. Red Hat says this approach is intended to empower enterprises to deploy and scale generative AI in production with increased confidence.
Joe Fernandes, Vice President and General Manager for Red Hat's AI Business Unit, commented on the launch: "Inference is where the real promise of gen AI is delivered, where user interactions are met with fast, accurate responses delivered by a given model, but it must be delivered in an effective and cost-efficient way. Red Hat AI Inference Server is intended to meet the demand for high-performing, responsive inference at scale while keeping resource demands low, providing a common inference layer that supports any model, running on any accelerator in any environment."
The inference phase in AI refers to the process where pre-trained models are used to generate outputs, a stage which can be a significant inhibitor to performance and cost efficiency if not managed appropriately. The increasing complexity and scale of generative AI models have highlighted the need for robust inference solutions capable of handling production deployments across diverse infrastructures.
The Red Hat AI Inference Server builds on the technology foundation established by the vLLM project. vLLM is known for high-throughput AI inference, ability to handle large input context, acceleration over multiple GPUs, and continuous batching to enhance deployment versatility. Additionally, vLLM extends support to a broad range of publicly available models, including DeepSeek, Google's Gemma, Llama, Llama Nemotron, Mistral, and Phi, among others. Its integration with leading models and enterprise-grade reasoning capabilities places it as a candidate for a standard in AI inference innovation.
The packaged enterprise offering delivers a supported and hardened distribution of vLLM, with several additional tools. These include intelligent large language model (LLM) compression utilities to reduce AI model sizes while preserving or enhancing accuracy, and an optimised model repository hosted under Red Hat AI on Hugging Face. This repository enables instant access to validated and optimised AI models tailored for inference, designed to help improve efficiency by two to four times without the need to compromise on the accuracy of results.
Red Hat also provides enterprise support, drawing upon expertise in bringing community-developed technologies into production. For expanded deployment options, the Red Hat AI Inference Server can be run on non-Red Hat Linux and Kubernetes platforms in line with the company's third-party support policy.
The company's stated vision is to enable a universal inference platform that can accommodate any model, run on any accelerator, and be deployed in any cloud environment. Red Hat sees the success of generative AI relying on the adoption of such standardised inference solutions to ensure consistent user experiences without increasing costs.
Ramine Roane, Corporate Vice President of AI Product Management at AMD, said: "In collaboration with Red Hat, AMD delivers out-of-the-box solutions to drive efficient generative AI in the enterprise. Red Hat AI Inference Server enabled on AMD InstinctTM GPUs equips organizations with enterprise-grade, community-driven AI inference capabilities backed by fully validated hardware accelerators."
Jeremy Foster, Senior Vice President and General Manager at Cisco, commented on the joint opportunities provided by the offering: "AI workloads need speed, consistency, and flexibility, which is exactly what the Red Hat AI Inference Server is designed to deliver. This innovation offers Cisco and Red Hat opportunities to continue to collaborate on new ways to make AI deployments more accessible, efficient and scalable—helping organizations prepare for what's next."
Intel's Bill Pearson, Vice President of Data Center & AI Software Solutions and Ecosystem, said: "Intel is excited to collaborate with Red Hat to enable Red Hat AI Inference Server on Intel Gaudi accelerators. This integration will provide our customers with an optimized solution to streamline and scale AI inference, delivering advanced performance and efficiency for a wide range of enterprise AI applications."
John Fanelli, Vice President of Enterprise Software at NVIDIA, added: "High-performance inference enables models and AI agents not just to answer, but to reason and adapt in real time. With open, full-stack NVIDIA accelerated computing and Red Hat AI Inference Server, developers can run efficient reasoning at scale across hybrid clouds, and deploy with confidence using Red Hat Inference Server with the new NVIDIA Enterprise AI validated design."
Red Hat has stated its intent to further build upon the vLLM community as well as drive development of distributed inference technologies such as llm-d, aiming to establish vLLM as an open standard for inference in hybrid cloud environments.

Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

'Pretty damn average': Google's AI Overviews underwhelm
'Pretty damn average': Google's AI Overviews underwhelm

RNZ News

timea day ago

  • RNZ News

'Pretty damn average': Google's AI Overviews underwhelm

Photo: JAAP ARRIENS Most searches online are done using Google. Traditionally, they've returned long lists of links to websites carrying relevant information. Depending on the topic, there can be thousands of entries to pick from or scroll through. Last year Google started incorporating its Gemini AI tech into its searches . Google's Overviews now inserts Google's own summary of what it's scraped from the internet ahead of the usual list of links to sources in many searches. Some sources say Google's now working towards replacing the lists of links with its own AI-driven search summaries. RNZ's Kathryn Ryan's not a fan. "Pretty damn average I have to say, for the most part," she said on Nine to Noon last Monday during a chat about AI upending the business of digital marketing. But Kathryn Ryan is not the only one underwhelmed by Google's Overviews. Recently, online tech writers discovered you can trick it into thinking that made up sayings are actually idioms in common usage that are meaningful. The Sydney Morning Herald 's puzzle compiler David Astle - under the headline 'Idiom or Idiot?' reckoned Google's AI wasn't about to take his job making cryptic crosswords anytime soon. "There is a strange bit of human psychology which says that we expect a very high bar from machines in a way that we don't from humans," the BBC's head of technology forecasting Laura Ellis told Mediawatch last month. "But if you've got a machine making a mistake, where does that accountability fall? We've just not tested this out yet." UK Sky News deputy political editor Sam Coates tried to make ChatGPT accountable after it made up an entire episode of his own politics podcast when he used it to help archive transcripts of it recently. "AI had told a lie that it had got the transcript. And rather than back down it invented an entire fake episode without flagging that it's fake." When challenged on this, the technology insisted Coates had created the episode himself. When ChatGPT can't find an answer or the right data to draw on, it can 'hallucinate' or just make up a misleading response. "Chat GTP is gaslighting me. No such thing exists. It's all a complete fake," Coates spluttered. After turning ChatGPT off and on again in 'conversation mode', it did eventually own up. "It said; 'Look, you're absolutely right to challenge that. I can't remember the exact time that you uploaded.' And then: 'What I can confirm is that I did it and you're holding me to account,'" Coates told viewers. He went on to challenge ChatGPT about its hallucinations getting worse. "The technology is always improving, and newer versions tend to do a better job at staying accurate," ChatGPT replied. But Coates - armed with data that suggested the opposite - asked ChatGPT for specific stats. The response: "According to recent internal tests from OpenAI, the newer models have shown higher hallucination rates. For instance, the model known as o3 had about a 33 percent hallucination rate, while the 04 mini model had around 48 percent." "I get where you're coming from, and I'm sorry for the mixed messages. The performance of these models can vary." When Coates aired his experience as a warning for journalists, some reacted with alarm. "The hallucination rate of advanced models... is increasing. As journos, we really should avoid it," said Sunday Times writer and former BBC diplomatic editor Mark Urban. But some tech experts accused Coates of misunderstanding and misusing the technology. "The issues Sam runs into here will be familiar to experienced users, but it illustrates how weird and alien Large Language Model (LLM) behaviour can seem for the wider public," said Cambridge University AI ethicist Henry Shevlin. "We need to communicate that these are generative simulators rather than conventional programmes," he added. Others were less accommodating on social media. "All I am seeing here is somebody working in the media who believes they understand how technology works - but [he] doesn't - and highlighting the dangers of someone insufficiently trained in technology trying to use it." "It's like Joey from Friends using the thesaurus function on Word." Mark Honeychurch is a programmer and long serving stalwart of the NZ Skeptics, a non profit body promoting critical thinking and calling out pseudoscience. The Skeptics' website said they confront practices that exploit a lack of specialist knowledge among people. That's what many people use Google for - answers to things they don't know or things they don't understand. Mark Honeychurch described putting overviews to the test in a recent edition of the Skeptics' podcast Yeah, Nah . "The AI looked like it was bending over backwards to please people. It's trying to give an answer that it knows that the customer wants," Honeychurch told Mediawatch . Honeychurch asked Google for the meaning of: 'Better a skeptic than two geese.' "It's trying to do pattern-matching and come out with something plausible. It does this so much that when it sees something that looks like an idiom that it's never heard before, it sees a bunch of idioms that have been explained and it just follows that pattern." "It told me a skeptic is handy to have around because they're always questioning - but two geese could be a handful and it's quite hard to deal with two geese." "With some of them, it did give me a caveat that this doesn't appear to be a popular saying. Then it would launch straight into explaining it. Even if it doesn't make sense, it still gives it its best go because that's what it's meant to do." In time, would AI and Google detect the recent articles pointing out this flaw - and learn from them? "There's a whole bunch of base training where (AI) just gets fed data from the Internet as base material. But on top of that, there's human feedback. "They run it through a battery of tests and humans can basically mark the quality of answers. So you end up refining the model and making it better. "By the time I tested this, it was warning me that a few of my fake idioms don't appear to be popular phrases. But then it would still launch into trying to explain it to me anyway, even though it wasn't real." Things got more interesting - and alarming - when Honeychurch tested Google Overviews with real questions about religion, alternative medicine and skepticism. "I asked why you shouldn't be a skeptic. I got a whole bunch of reasons that sounded plausible about losing all your friends and being the boring person at the party that's always ruining stories." "When I asked it why you should be a skeptic, all I got was a message saying it cannot answer my question." He also asked why one should be religious - and why not. And what reasons we should trust alternative medicines - and why we shouldn't. "The skeptical, the rational, the scientific answer was the answer that Google's AI just refused to give." "For the flip side of why I should be religious, I got a whole bunch of answers about community and a feeling of warmth and connecting to my spiritual dimension. "I also got a whole bunch about how sometimes alternative medicine may have turned out to be true and so you can't just dismiss it." "But we know why we shouldn't trust alternative medicine. It's alternative so it's not been proven to work. There's a very easy answer." But not one Overview was willing or able to give, it seems. Google does answer the neutral question 'Should I trust alternative medicine?' by saying there is "no simple answer" and "it's crucial to approach alternative medicine with caution and prioritise evidence-based conventional treatments." So is Google trying not to upset people with answers that might concern them? "I don't want to guess too much about that. It's not just Google but also OpenAI and other companies doing human feedback to try and make sure that it doesn't give horrific answers or say things that are objectionable." "But it's always conflicting with the fact that this AI is just trained to give you that plausible answer. It's trying to match the pattern that you've given in the question." Journalists use Google, just like anyone who's in a hurry and needs information quickly. Do journalists need to ensure they don't rely on the Overviews summary right at the top of the search page? "Absolutely. This is AI use 101. If you're asking something of a technical question, you really need to be well enough versed in what you're asking that you can judge whether the answer is good or not." Sign up for Ngā Pitopito Kōrero , a daily newsletter curated by our editors and delivered straight to your inbox every weekday.

Ceasefire Holds, But Experts Warn Cyber Tensions Between Iran And The West May Be Far From Over
Ceasefire Holds, But Experts Warn Cyber Tensions Between Iran And The West May Be Far From Over

Scoop

time4 days ago

  • Scoop

Ceasefire Holds, But Experts Warn Cyber Tensions Between Iran And The West May Be Far From Over

As a U.S.-brokered ceasefire between Israel and Iran holds for now, cybersecurity experts are urging vigilance—noting that while military activity may have paused, cyber tensions are likely to continue simmering beneath the surface. 'In light of recent developments, the likelihood of disruptive cyberattacks against U.S. targets by Iranian actors has increased,' said John Hultquist, chief analyst at Google's Threat Intelligence Group. 'Iran already targets the U.S. with cyberespionage… and individuals associated with Iran policy should be on the lookout for social engineering schemes.' A new report from cybersecurity firm Radware adds weight to those concerns, warning that the Israel-Iran conflict has seen an evolution into a hybrid war that includes cyberspace. According to their latest advisory: Nearly 40% of global DDoS activity recently targeted Israel, with signs of spillover affecting the U.S., U.K., and Jordan. Hacker groups such as DieNet, Arabian Ghosts, and Sylhet Gang have issued warnings or taken credit for attacks, some aimed at Western nations. AI-generated disinformation and deepfakes have appeared across digital platforms, contributing to confusion and information warfare. 'Critical infrastructure, supply chains, and global businesses could become collateral targets if cyber tensions escalate further,' said Pascal Geenens, Director of Threat Intelligence at Radware. 'The Israel-Iran conflict of 2025 is a stark illustration of how modern hybrid warfare plays out online as much as in the real world.' While the ceasefire has reduced the immediate risk of open military confrontation, experts believe that cyberspace may remain a domain for ongoing friction—especially as cyber operations allow for plausible deniability and targeted disruption. Hultquist cautioned that while Iranian cyber operations may sometimes exaggerate their impact, the risk for individual organisations remains serious. 'We should be careful not to overestimate these incidents and inadvertently assist the actors,' he said. 'The impacts may still be very serious for individual enterprises, which can prepare by taking many of the same steps they would to prevent ransomware.' For now, the digital front may be quiet—but beneath the surface, it's likely that espionage and influence operations are still underway.

Google launches Gemini 2.5 models with pricing & speed updates
Google launches Gemini 2.5 models with pricing & speed updates

Techday NZ

time5 days ago

  • Techday NZ

Google launches Gemini 2.5 models with pricing & speed updates

Google has released updates to its Gemini 2.5 suite of artificial intelligence models, detailing stable releases, new offerings, and pricing changes. Model releases The company announced that Gemini 2.5 Pro and Gemini 2.5 Flash are now generally available and deemed stable, maintaining the same versions that had previously been available for preview. In addition, Google introduced Gemini 2.5 Flash-Lite in preview, providing an option focused on cost-effectiveness and latency within the Gemini 2.5 product line. Gemini 2.5 models are described as "thinking models" capable of reasoning through their processes before generating responses, a feature that is expected to enhance the performance and accuracy of the tools. The models allow developers to manage a so-called "thinking budget", granting greater control over the depth and speed of reasoning based on the needs of individual applications. Gemini 2.5 Flash-Lite Gemini 2.5 Flash-Lite is intended as an upgrade for customers currently using previous iterations such as Gemini 1.5 and 2.0 Flash models. According to the company, the new model improves performance across several evaluation measures, reduces the time to first token, and increases decoding speed in terms of tokens per second. Flash-Lite is targeted at high-volume use cases like classification and summarisation at scale, where throughput and cost are key considerations. This model provides API-level control for dynamic management of the "thinking budget." It is set apart from other Gemini 2.5 models in that its "thinking" function is deactivated by default, reflecting its focus on cost and speed. Gemini 2.5 Flash-Lite includes existing features such as grounding with Google Search, code execution, URL context, and support for function calling. Updates and pricing Google also clarified changes to the Gemini 2.5 Flash model and its associated pricing structure. The pricing for 2.5 Flash has been updated to USD $0.30 per 1 million input tokens (increased from USD $0.15) and USD $2.50 per 1 million output tokens (reduced from USD $3.50). The company removed the distinction between "thinking" and "non-thinking" pricing and established a single price tier, irrespective of input token size. In a joint statement, Shrestha Basu Mallick, Group Product Manager, and Logan Kilpatrick, Group Product Manager, said: "While we strive to maintain consistent pricing between preview and stable releases to minimize disruption, this is a specific adjustment reflecting Flash's exceptional value, still offering the best cost-per-intelligence available. And with Gemini 2.5 Flash-Lite, we now have an even lower cost option (with or without thinking) for cost and latency sensitive use cases that require less model intelligence." Customers using Gemini 2.5 Flash Preview from April will retain their existing pricing until the model's planned deprecation on July 15, 2025, after which they will be required to transition to the updated stable version or move to Flash-Lite Preview. Continued growth for Gemini 2.5 Pro Google reported that demand for Gemini 2.5 Pro is "the steepest of any of our models we have ever seen." The stable release of the 06-05 version is intended to increase capacity for customers using Gemini 2.5 Pro in production environments, maintaining the existing price point. The company indicated that the model is particularly well-suited for tasks requiring significant intelligence and advanced capabilities, such as coding and agentic tasks, and noted its adoption in a range of developer tools. "We expect that cases where you need the highest intelligence and most capabilities are where you will see Pro shine, like coding and agentic tasks. Gemini 2.5 Pro is at the heart of many of the most loved developer tools." Google highlighted a range of tools built on Gemini 2.5 Pro, including offerings from Cursor, Bolt, Cline, Cognition, Windsurf, GitHub, Lovable, Replit, and Zed Industries. The company advised that users of the 2.5 Pro Preview 05-06 model will be able to access it until June 19, 2025, when it will be discontinued. Those using the 06-05 preview version are directed to update to the now-stable "gemini-2.5-pro" model. The statement concluded: "We can't wait to see even more domains benefit from the intelligence of 2.5 Pro and look forward to sharing more about scaling beyond Pro in the near future."

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store