SambaNova unveils record-breaking inference cloud, outpaces rivals in AI model speed

SambaNova is stepping into the competitive AI infrastructure scene with the launch of its new inference cloud, boasting record-breaking speeds for Meta’s Llama 3.1 model. The company claims its system can churn out 132 tokens per second for the massive 405-billion-parameter model, more than twice the speed of rival GPU systems, according to data cited from Artificial Analysis.

CEO Rodrigo Liang highlighted that SambaNova’s infrastructure, powered by its SN40L accelerators, allows the model to run at its full 16-bit precision, delivering a significant performance advantage in handling AI tasks. SambaNova’s cloud is set to provide faster API access to AI models for enterprises, further intensifying competition among AI infrastructure vendors like Cerebras and Groq.

The SN40L accelerator plays a critical role in this performance, utilizing 64 GB of HBM3 memory and advanced cache capabilities to maintain its speed, even with up to four simultaneous requests. SambaNova’s approach avoids the bottlenecks common in multi-GPU setups, allowing the company to keep performance steady, despite scaling demands. However, the system isn’t without limitations, such as its reduced 8k token context window for now, which could impact more complex, longer-context tasks.

As AI infrastructure providers race to outdo one another, Cerebras and Groq are also making bold claims with their own Llama models, further heating up the competition. Still, SambaNova’s launch of its inference cloud, with free and enterprise tiers available immediately, sets the stage for a new level of speed in AI model deployment. The stakes are higher than ever in this rapidly evolving landscape.