TPS, not hype: Certifying streaming decode performance
We publish tokens/sec (TPS) from median CUDA timing, with JSONL certificates with sha256 checksum you can reproduce locally. No marketing fluff, no cherry-picked numbers—just verifiable performance metrics.
Why TPS
Speedups vary with baselines and flags. Everyone's "2× faster" means something different. TPS = 1000 / median_latency_ms is direct and comparable.
When you see "185.2 tokens/sec", you know exactly what that means: the system generates 185.2 tokens per second. Not "up to" or "in ideal conditions"—that's the measured median performance.
We report per-token TPS even when small-batch pipelining is used (time/B). This gives you the actual per-token throughput, which is what matters for real deployments.
How we measure
No magic, just rigorous methodology:
- CUDA Events with synchronize: Hardware-level timing, not CPU approximations
- N iterations: User-set; defaults --iters 100, --warmup 10
- Statistical robustness: Sort all measurements, take the median
- Formula:
TPS = 1000.0 / median_latency_ms
- Tail latency: p95 as companion metric for consistency
This isn't new—it's how GPU performance should always be measured. We just actually do it and publish the results.
What's certified
Every run produces a JSONL certificate with these fields:
{
"git_sha": "abc123...",
"device": "NVIDIA GeForce RTX 5070 Ti",
"sm": 120,
"model": "llama-8b",
"dtype": "bf16",
"shape": {"L": 8192, "D": 128, "H": 32, "H_kv": 8},
"knobs": {...},
"iters": 50,
"latency_ms": {"med": 5.399, "p95": 5.997},
"baseline_ms": 14.366,
"speedup": 2.66,
"checksum": "0x8a3f...",
"timestamp": "2025-08-23T10:30:45Z",
"repro_cmd": "lopolith run --model llama-8b ...",
"tokens_per_sec": 185.2,
"autosel": {"gemv": "CUDA_EXT", "attn": "TRITON"},
"per_op": {...}
}
Every field is measured, not estimated. The checksum ensures the certificate hasn't been modified. The repro command lets you verify locally.
Reproduce it
Want to verify our numbers? Run this:
lopolith run --model llama-8b --dtype bf16 --seq 8192 --iters 50 --warmup 10
Optional flags to explore:
- --persistent
: Enable persistent kernels (default on SM≥120)
- --small-batch 2
: Pipeline with batch size 2 (reports per-token TPS)
- --no-prealloc
: Disable preallocated IO buffers (may reduce memory at the cost of latency)
The tool outputs both human-readable results and a certificate you can verify.
Case study
Here's a real run from our testing:
Configuration: Llama-8B, bf16, 8k context (per-token metrics from B=2)
- Baseline: 69.6 tokens/sec
- Optimized: 185.2 tokens/sec
- Gain: ~2.66× TPS
Results vary by GPU, model, and flags. These measurements from RTX 5070 Ti, Llama-8B, bf16, 8k context.
How we get there
The performance comes from principled optimization, not tricks:
Hardware fingerprinting → closed-form scheduler knobs. We profile your specific GPU's characteristics (L2 bandwidth, SFU throughput, etc.) and derive optimal scheduling parameters. No trial-and-error.
Persistent decode kernels. Keep the kernel resident, eliminate launch overhead. Combined with preallocated IO buffers and fused-Q operations (all parity-guarded), we minimize memory traffic.
Small-batch pipelining. Process multiple tokens while reporting accurate per-token metrics. The hardware stays busy, you get true throughput numbers.
Automatic backend selection with safety. The system promotes backends when they're provably faster, demotes on any correctness issue. You always have the baseline available as fallback.
Where time goes
Performance optimization starts with understanding where cycles are spent:
Perf-audit medians show attention dominates at long context. On our reference configuration: - Attention: ~0.700ms (70% of time) - GEMVs: ~0.047ms each - RMSNorm: ~0.086ms
This is why attention optimization matters so much. Our TRITON backend achieves ~2.3× speedup over baseline on the attention kernel alone (0.664ms vs 1.535ms in microbenchmarks).
Proof artifacts
Every optimization run generates three artifacts:
1. Certificate JSONL: Complete run metadata and measurements 2. Perf-audit JSON: Per-operation breakdown with roofline metrics 3. Chrome trace: Detailed timeline for profiling (optional)
Here's a certificate excerpt highlighting the key metrics:
{
"tokens_per_sec": 185.2,
"latency_ms": {
"med": 5.399,
"p95": 5.997
},
"verification_ok": true,
"repro_cmd": "lopolith run --model llama-8b --dtype bf16 --seq 8192 --iters 50 --warmup 10"
}
## What to expect next
We're building in the open. Coming posts will cover:
- Scheduler internals: How we derive optimal knobs from hardware properties
- Attention kernel deep-dive: Why TRITON wins and where CUTLASS struggles
- Integration guides: Step-by-step for PyTorch, vLLM, and TensorRT-LLM
The code is real, the measurements are honest, and everything is reproducible.
Generate your certificate
Ready to measure your own system? Two ways to start:
Sam, Founder of Lopolith
Have questions about our methodology? Email me directly at founder@lopolith.com