Performance Benchmarks

Verified tokens/sec measurements with reproducible CUDA Event timing

Methodology

TPS is measured directly from median latency in code. Some speedup fields in outputs are advisory and not used as the headline.

→

Microbench vs end‑to‑end

Attention microbench (e.g., 0.664 ms vs 1.535 ms → ~2.31×) does not imply full decode speedup.

→

Autosel and parity

Backends may promote/demote for speed/correctness; records include reasons and verification_ok.

→

Reproducibility

Certificates include git_sha, device, shape, knobs, med/p95, repro_cmd; TPS printed explicitly.

Don't see your workload?