| Model |
Backend |
Think |
Prompt |
Output |
Eval (s) |
Wall (s) |
tok/s |
| qwen3:30b |
ROCm |
off |
47 |
2,012 |
31.87 |
37.53 |
63.1 |
| qwen3:30b |
ROCm |
off |
154 |
4,501 |
74.70 |
76.09 |
60.3 |
| qwen3:30b |
ROCm |
on |
47 |
2,094 |
33.32 |
34.18 |
62.8 |
| deepseek-r1:70b |
ROCm |
— |
36 |
961 |
192.30 |
212.99 |
5.0 |
| deepseek-r1:70b |
ROCm |
— |
timed out (medium context) |
| deepseek-r1:70b |
ROCm |
— |
timed out (long context) |
| Qwen3-1.7B-GGUF |
NPU |
— |
19 |
512 |
— |
3.89 |
131.8 |
| Qwen3-1.7B-GGUF |
NPU |
— |
126 |
512 |
— |
3.99 |
128.4 |
| Model |
Backend |
Processor |
Prompt |
Output |
tok/s |
Notes |
| qwen3:14b |
ROCm |
100% GPU |
47 |
256 |
55.4 |
After unlock |
| qwen3:14b |
Vulkan |
CPU fallback |
47 |
256 |
8.5 |
Before unlock |
| gemma4:e4b |
ROCm |
32% GPU |
47 |
256 |
31.3 |
MoE partial offload |
| gemma4:e4b |
Vulkan |
Vulkan |
47 |
256 |
16.6 |
Before unlock |
ROCm unlock — qwen3:14b on RX 9070 XT (Windows 11)
Vulkan: 8.5 tok/s •
ROCm gfx1201: 55.4 tok/s
+552% throughput gain.
Same hardware, same model — driver stack is everything.
Fix: HIP SDK 7.1.1 + ROCBLAS_TENSILE_LIBPATH env var.
ROCm vs Windows — qwen3:30b on Ryzen AI Max+ 395
Linux (ROCm 7.2): 62 tok/s •
Windows 11: 11 tok/s
+464% throughput gain with ROCm on the same hardware.
Same chip, same RAM, same model weights — the only variable is the driver stack.
XDNA2 NPU — Qwen3-1.7B via Lemonade
131.8 tok/s •
0 W GPU draw •
Runs entirely on the dedicated NPU — GPU stays free for heavier workloads.
Inference via Lemonade runtime on XDNA2 silicon.