AA-AgentPerf releases initial results for DeepSeek V4 Pro benchmark, showing NVIDIA Blackwell dominance

1 hour ago 25

Artificial Analysis has dropped something the AI hardware world has been quietly waiting for: an actual benchmark that measures how well chips handle agentic AI workloads in the real world. The benchmark is called AA-AgentPerf, and its initial results running DeepSeek V4 Pro tell a story that AMD probably would rather not hear right now.

NVIDIA’s Blackwell systems, specifically the B200 and GB300, consistently outperformed AMD’s Instinct MI355X GPUs on power-efficient agentic inference.

What AA-AgentPerf actually measures

It’s the first multi-vendor open benchmark from Artificial Analysis designed specifically for hardware performance in agentic coding tasks.

The benchmark evaluates how many concurrent agents a system can support while meeting specific service-level objectives. Those SLOs cover output token speeds ranging from 20 to 300 tokens per second and time-to-first-token (TTFT) targets between 3 and 10 seconds.

Rather than relying on synthetic evaluation methods, the benchmark leverages actual coding trajectories. Results are then normalized per accelerator and per megawatt, which creates a comparison framework that accounts for both raw performance and energy consumption.

DeepSeek V4 Pro enters the chat

The model at the center of this benchmark is DeepSeek V4 Pro, which has been turning heads since its release around April 2026. It scored 1554 on the GDPval-AA benchmark, placing it firmly among the top-performing open-weights models available today.

DeepSeek V4 Pro (Max) also earned a score of 52 on the Artificial Analysis Intelligence Index, ranking it second among open-weights reasoning models.

NVIDIA vs. AMD and what it means for the data center market

The initial AA-AgentPerf results paint a clear picture of competitive positioning. NVIDIA’s Blackwell architecture, represented by the B200 and GB300 systems, delivered superior performance per watt compared to AMD’s MI355X across the tested agentic workloads.

The per-megawatt normalization is especially telling. Data centers are increasingly constrained not by rack space or capital budgets but by power availability. A chip that can support more concurrent agents per megawatt of power consumed has a tangible, quantifiable advantage that translates directly to the bottom line.

For NVIDIA, these results reinforce a narrative the company has been building around Blackwell’s efficiency characteristics. The timing is notable: the performance leadership data was reported relative to a June 12, 2026 crawl date, suggesting NVIDIA moved quickly to publicize favorable results through its developer blog.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Read Entire Article