rawmark
Real hardware. Real numbers. No vendor spin.
63.1
tok/s · qwen3:30b · ROCm
131.8
tok/s · Qwen3-1.7B · XDNA2 NPU
+464%
ROCm vs Windows · qwen3:30b
55.4
tok/s · qwen3:14b · ROCm gfx1201
+552%
ROCm vs Vulkan · qwen3:14b
Forge CPU AMD Ryzen AI Max+ 395 GPU Radeon 8060S 96 GB RAM 128 GB LPDDR5X OS Ubuntu 24.04 · Kernel 6.17 HWE ROCm 7.2 (gfx1151)
Model Backend Think Prompt Output Eval (s) Wall (s) tok/s
qwen3:30b ROCm off 47 2,012 31.87 37.53 63.1
qwen3:30b ROCm off 154 4,501 74.70 76.09 60.3
qwen3:30b ROCm on 47 2,094 33.32 34.18 62.8
deepseek-r1:70b ROCm 36 961 192.30 212.99 5.0
deepseek-r1:70b ROCm timed out (medium context)
deepseek-r1:70b ROCm timed out (long context)
Qwen3-1.7B-GGUF NPU 19 512 3.89 131.8
Qwen3-1.7B-GGUF NPU 126 512 3.99 128.4
Bifrost CPU AMD Ryzen 9 3950X GPU Radeon RX 9070 XT 16 GB RDNA4 RAM 64 GB DDR4 OS Windows 11 Pro ROCm 7.1 gfx1201 (HIP SDK 7.1.1) Ollama 0.20.4
Model Backend Processor Prompt Output tok/s Notes
qwen3:14b ROCm 100% GPU 47 256 55.4 After unlock
qwen3:14b Vulkan CPU fallback 47 256 8.5 Before unlock
gemma4:e4b ROCm 32% GPU 47 256 31.3 MoE partial offload
gemma4:e4b Vulkan Vulkan 47 256 16.6 Before unlock
ROCm unlock — qwen3:14b on RX 9070 XT (Windows 11)
Vulkan: 8.5 tok/s  •  ROCm gfx1201: 55.4 tok/s
+552% throughput gain. Same hardware, same model — driver stack is everything.
Fix: HIP SDK 7.1.1 + ROCBLAS_TENSILE_LIBPATH env var.
ROCm vs Windows — qwen3:30b on Ryzen AI Max+ 395
Linux (ROCm 7.2): 62 tok/s  •  Windows 11: 11 tok/s
+464% throughput gain with ROCm on the same hardware. Same chip, same RAM, same model weights — the only variable is the driver stack.
XDNA2 NPU — Qwen3-1.7B via Lemonade
131.8 tok/s  •  0 W GPU draw  •  Runs entirely on the dedicated NPU — GPU stays free for heavier workloads. Inference via Lemonade runtime on XDNA2 silicon.