TPS, not hype: Certifying streaming decode performance

August 23, 20258 min readSam, Founder of Lopolith
PerformanceVerificationBenchmarks

We publish tokens/sec (TPS) from median CUDA timing, with JSONL certificates with sha256 checksum you can reproduce locally. No marketing fluff, no cherry-picked numbers—just verifiable performance metrics.

Why TPS

Speedups vary with baselines and flags. Everyone's "2× faster" means something different. TPS = 1000 / median_latency_ms is direct and comparable.

When you see "185.2 tokens/sec", you know exactly what that means: the system generates 185.2 tokens per second. Not "up to" or "in ideal conditions"—that's the measured median performance.

We report per-token TPS even when small-batch pipelining is used (time/B). This gives you the actual per-token throughput, which is what matters for real deployments.

How we measure

No magic, just rigorous methodology:

  • CUDA Events with synchronize: Hardware-level timing, not CPU approximations
  • N iterations: User-set; defaults --iters 100, --warmup 10
  • Statistical robustness: Sort all measurements, take the median
  • Formula: TPS = 1000.0 / median_latency_ms
  • Tail latency: p95 as companion metric for consistency

This isn't new—it's how GPU performance should always be measured. We just actually do it and publish the results.

What's certified

Every run produces a JSONL certificate with these fields:

{
  "git_sha": "abc123...",
  "device": "NVIDIA GeForce RTX 5070 Ti",
  "sm": 120,
  "model": "llama-8b",
  "dtype": "bf16",
  "shape": {"L": 8192, "D": 128, "H": 32, "H_kv": 8},
  "knobs": {...},
  "iters": 50,
  "latency_ms": {"med": 5.399, "p95": 5.997},
  "baseline_ms": 14.366,
  "speedup": 2.66,
  "checksum": "0x8a3f...",
  "timestamp": "2025-08-23T10:30:45Z",
  "repro_cmd": "lopolith run --model llama-8b ...",
  "tokens_per_sec": 185.2,
  "autosel": {"gemv": "CUDA_EXT", "attn": "TRITON"},
  "per_op": {...}
}

Every field is measured, not estimated. The checksum ensures the certificate hasn't been modified. The repro command lets you verify locally.

Reproduce it

Want to verify our numbers? Run this:

lopolith run --model llama-8b --dtype bf16 --seq 8192 --iters 50 --warmup 10

Optional flags to explore: - --persistent: Enable persistent kernels (default on SM≥120) - --small-batch 2: Pipeline with batch size 2 (reports per-token TPS) - --no-prealloc: Disable preallocated IO buffers (may reduce memory at the cost of latency)

The tool outputs both human-readable results and a certificate you can verify.

Case study

Here's a real run from our testing:

Configuration: Llama-8B, bf16, 8k context (per-token metrics from B=2)

  • Baseline: 69.6 tokens/sec
  • Optimized: 185.2 tokens/sec
  • Gain: ~2.66× TPS
PER-MODULE MEDIAN LATENCY (MS)
Attention
0.700
Q GEMV
0.047
O GEMV
0.047
RMSNorm
0.086

Results vary by GPU, model, and flags. These measurements from RTX 5070 Ti, Llama-8B, bf16, 8k context.

How we get there

The performance comes from principled optimization, not tricks:

Hardware fingerprinting → closed-form scheduler knobs. We profile your specific GPU's characteristics (L2 bandwidth, SFU throughput, etc.) and derive optimal scheduling parameters. No trial-and-error.

Persistent decode kernels. Keep the kernel resident, eliminate launch overhead. Combined with preallocated IO buffers and fused-Q operations (all parity-guarded), we minimize memory traffic.

Small-batch pipelining. Process multiple tokens while reporting accurate per-token metrics. The hardware stays busy, you get true throughput numbers.

Automatic backend selection with safety. The system promotes backends when they're provably faster, demotes on any correctness issue. You always have the baseline available as fallback.

Where time goes

Performance optimization starts with understanding where cycles are spent:

Perf-audit medians show attention dominates at long context. On our reference configuration: - Attention: ~0.700ms (70% of time) - GEMVs: ~0.047ms each - RMSNorm: ~0.086ms

This is why attention optimization matters so much. Our TRITON backend achieves ~2.3× speedup over baseline on the attention kernel alone (0.664ms vs 1.535ms in microbenchmarks).

Proof artifacts

Every optimization run generates three artifacts:

1. Certificate JSONL: Complete run metadata and measurements 2. Perf-audit JSON: Per-operation breakdown with roofline metrics 3. Chrome trace: Detailed timeline for profiling (optional)

Here's a certificate excerpt highlighting the key metrics:

{
  "tokens_per_sec": 185.2,
  "latency_ms": {
    "med": 5.399,
    "p95": 5.997
  },
  "verification_ok": true,
  "repro_cmd": "lopolith run --model llama-8b --dtype bf16 --seq 8192 --iters 50 --warmup 10"
}

## What to expect next

We're building in the open. Coming posts will cover:

  • Scheduler internals: How we derive optimal knobs from hardware properties
  • Attention kernel deep-dive: Why TRITON wins and where CUTLASS struggles
  • Integration guides: Step-by-step for PyTorch, vLLM, and TensorRT-LLM

The code is real, the measurements are honest, and everything is reproducible.

Generate your certificate

Ready to measure your own system? Two ways to start:

Generate your certificate → View proof →

Sam, Founder of Lopolith

Have questions about our methodology? Email me directly at founder@lopolith.com