Performance Benchmarks
Verified tokens/sec measurements with reproducible CUDA Event timing
LATEST RESULT
Llama-8B Decode
August 22, 2025 • RTX 5070 Ti
69.6
Baseline tokens/sec
185.2
Optimized tokens/sec
2.66×
TPS Gain (Median)
1 results
Methodology
Definitions
- •Tokens/sec (TPS): 1000 / median_latency_ms measured via CUDA events
- •Small‑batch pipelining (B>1): elapsed time divided by B for per‑token latency
- •When B>1, per‑token latency = elapsed/B
- •Autosel backends: attention may promote to TRITON; GEMV may demote for parity
- •We report medians; speedup fields in logs are advisory
- •Verification and safe fallback on error; records include verification_ok
Disclaimers
TPS varies; speedups are less stable
TPS is measured directly from median latency in code. Some speedup fields in outputs are advisory and not used as the headline.
→
Microbench vs end‑to‑end
Attention microbench (e.g., 0.664 ms vs 1.535 ms → ~2.31×) does not imply full decode speedup.
→
Autosel and parity
Backends may promote/demote for speed/correctness; records include reasons and verification_ok.
→
Reproducibility
Certificates include git_sha, device, shape, knobs, med/p95, repro_cmd; TPS printed explicitly.
Don't see your workload?
Request a Shape→