Performance Benchmarks

Verified tokens/sec measurements with reproducible CUDA Event timing

LATEST RESULT

Llama-8B Decode

August 22, 2025 • RTX 5070 Ti

69.6
Baseline tokens/sec
185.2
Optimized tokens/sec
2.66×
TPS Gain (Median)
1 results

Methodology

Definitions

  • Tokens/sec (TPS): 1000 / median_latency_ms measured via CUDA events
  • Small‑batch pipelining (B>1): elapsed time divided by B for per‑token latency
  • When B>1, per‑token latency = elapsed/B
  • Autosel backends: attention may promote to TRITON; GEMV may demote for parity
  • We report medians; speedup fields in logs are advisory
  • Verification and safe fallback on error; records include verification_ok

Disclaimers

TPS varies; speedups are less stable

TPS is measured directly from median latency in code. Some speedup fields in outputs are advisory and not used as the headline.

Microbench vs end‑to‑end
Attention microbench (e.g., 0.664 ms vs 1.535 ms → ~2.31×) does not imply full decode speedup.
Autosel and parity
Backends may promote/demote for speed/correctness; records include reasons and verification_ok.
Reproducibility
Certificates include git_sha, device, shape, knobs, med/p95, repro_cmd; TPS printed explicitly.

Don't see your workload?

Request a Shape