Perplexity serves Qwen3 235B models on Nvidia GB200 racks, showing major inference gains

1 hour ago 15

Perplexity AI is now running massive language models on Nvidia’s newest hardware, and the performance jump is hard to ignore. The company has published technical research detailing its deployment of post-trained Qwen3 235B mixture-of-experts (MoE) models on Nvidia’s Blackwell-generation GB200 NVL72 racks, showing substantial improvements in both speed and cost over the previous Hopper-generation systems.

What Perplexity actually built

The setup involves GB200 NVL72 racks, each packing 72 GPUs with 180 GB of high-bandwidth memory apiece. Those GPUs are wired together via 72-way NVLink, delivering 1,800 GB/s of bandwidth between them.

Here’s where the numbers get interesting. Latency for NVLink all-reduce operations dropped from 586.1 microseconds on the H200 (Hopper) to 313.3 microseconds on the GB200. That’s a 46% reduction. MoE prefill combine time fell from 730.1 microseconds to 438.5 microseconds, roughly a 40% improvement.

Perplexity also reports achieving up to 30x real-time inference capability compared to H100 baselines for certain configurations.

The engineering under the hood

Perplexity’s research highlights several software-level optimizations that squeeze more performance out of the Blackwell architecture. These include Blackwell-native quantization, which reduces the precision of model weights to speed up computation without meaningfully degrading output quality. There’s also prefill/decode disaggregation, a technique that separates the initial processing of a prompt from the token-by-token generation phase. Custom kernels round out the optimization stack, with Perplexity writing specialized code tuned for the specific demands of serving a 235-billion-parameter MoE model on this particular hardware topology.

The combination of hardware and software improvements means the GB200 NVL72 setup significantly lowers inference costs while improving output quality compared to Hopper-based systems.

Why this matters for the broader AI hardware race

This deployment strengthens Nvidia’s position against alternatives like AMD’s MI300X and AWS’s custom Trainium chips. The 72-GPU NVLink topology delivering 1,800 GB/s bandwidth is particularly significant, as competing solutions often rely on slower interconnects between chips, which creates bottlenecks when serving models that need to coordinate across many GPUs simultaneously.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Read Entire Article