The Race for AI Inference: Challenging Nvidia’s Dominance
In the rapidly evolving landscape of artificial intelligence (AI), Nvidia has established itself as a formidable leader, particularly in the realm of AI training. However, as the focus shifts towards inference—the stage where AI models generate outputs based on their training—Nvidia’s competitors are gearing up for a fierce battle. With a market that is projected to see a significant increase in inference workloads, startups like SambaNova, Groq, and Cerebras are positioning themselves to chip away at Nvidia’s substantial head start, which is estimated to be worth trillions of dollars.
Understanding AI Inference
Inference is a critical component of AI computing, representing the production stage where trained models are put to work. Whether it’s generating images, crafting written responses, or executing complex tasks, inference chips are responsible for delivering the outputs that users expect from AI systems. As the demand for AI applications grows, so too does the need for efficient and powerful inference solutions.
Rodrigo Liang, co-founder of SambaNova Systems, has been eyeing Nvidia’s lead since the company’s inception in 2017. At that time, the AI ecosystem was still in its infancy, and inference workloads were relatively small. However, as foundation models have evolved in size and accuracy, the transition from training to inference has become increasingly apparent. According to Nvidia’s CFO, Colleen Kress, the company’s data center workloads have now reached 40% inference, with Liang predicting that this figure could rise to 90% in the near future.
Startups Targeting Inference Speed
To compete with Nvidia, newer players in the market are emphasizing speed as a key differentiator. Companies like Groq, Cerebras, and SambaNova are not only claiming to offer the fastest inference computing solutions but are also taking a different approach by eschewing traditional graphics processing units (GPUs) in favor of specialized architectures.
SambaNova, for instance, utilizes a reconfigurable dataflow unit (RDU) designed specifically for machine learning tasks. Liang argues that this architecture is inherently better suited for inference than the GPUs used by Nvidia and AMD, which were originally designed for rendering graphics. This sentiment is echoed by Andrew Feldman, CEO of Cerebras, who also believes that their technology can outperform Nvidia’s offerings in the inference space.
The Importance of Speed in AI
Speed is a crucial factor in AI inference, especially when multiple AI models need to communicate with one another. Delays in processing can hinder the seamless experience that users expect from generative AI applications. SambaNova’s RDUs are touted as particularly effective for agentic AI, which can perform tasks with minimal instruction.
However, measuring inference speed is complex. Various specifications of AI models, such as Meta’s Llama or OpenAI’s o1, influence how quickly results are generated. Additionally, the performance of chips can vary significantly based on their networking configurations and the specific setups of data centers. Metrics like tokens per second—representing the amount of data processed—are commonly used, but they don’t account for latency, which can arise from numerous factors.
Navigating the Competitive Landscape
To carve out a niche in the inference market, startups are exploring innovative business models that allow them to bypass direct competition with Nvidia. For example, SambaNova offers access to Meta’s open-source Llama foundation model through its cloud service, while Cerebras and Groq have launched similar offerings. This strategy positions these companies as competitors not only to Nvidia but also to AI foundation model providers like OpenAI.
Public platforms like Artificialanalysis.ai are beginning to compare inference-as-a-service offerings, revealing that Cerebras, SambaNova, and Groq are among the fastest APIs for Meta’s Llama models. Notably, Nvidia does not provide inference-as-a-service, which limits its visibility in these comparisons. While Nvidia remains a dominant player in MLPerf benchmarks for hardware computing speed, the emergence of these startups indicates a shifting landscape.
The Cost of Inference Solutions
While performance is a significant consideration, potential buyers must also evaluate the total cost of ownership (TCO) associated with different chip solutions. Dylan Patel, chief analyst at Semianalysis, suggests that GPUs may offer a superior TCO per token compared to the newer chips from startups. However, both SambaNova and Cerebras contest this viewpoint.
Liang argues that higher inference speeds can lead to larger hardware footprints, which may increase costs. Nevertheless, he claims that SambaNova compensates for this by delivering high speed and capacity with fewer chips, ultimately lowering costs. Feldman of Cerebras also challenges the notion that GPUs have a lower TCO, attributing Nvidia’s claims to its dominant market position rather than technological superiority.
The Future of AI Inference
As the AI inference market matures, the competition is expected to intensify. Startups are betting on their unique architectures and speed advantages to disrupt Nvidia’s stronghold. With the landscape evolving rapidly, it remains to be seen how these dynamics will play out and whether Nvidia can maintain its lead in a market that is increasingly focused on inference capabilities. The next few months will be critical as these companies strive to establish themselves in a space that is poised for significant growth.