1. Mental model
This calculator estimates autoregressive decode speed for local LLM inference. Decode means the model generates one new token at a time. The token passes through all transformer layers. If layers are distributed across GPUs, the activation moves through a GPU chain.
2. How to fill the calculator
Use the calculator as a hardware planning tool. Start with real model values, then real GPU specifications, then calibrate uncertain values such as effective TFLOPS and activation transfer size.
| Input | What to enter | Where to find it |
|---|---|---|
| Total Parameters | Total stored parameters. For Dense, this is the same number you normally see in the model name. For MoE, this includes all experts, even inactive ones. | Model card, paper, config file, architecture report, Hugging Face page. |
| Active Parameters / Token | For Dense, use total parameters. For MoE, use the active parameter count per token. If unknown, estimate from active experts plus shared dense parts. | Look for "active parameters", "experts per token", "top-k", "routed experts", "shared experts". |
| Bytes per Parameter | Approximate quantized storage size. FP16/BF16 = 2, FP8/Q8 = 1, Q4 = 0.5, Q3 = 0.375 before metadata overhead. | Quantization format, model filename, runtime loader output. |
| KV Cache Read + Storage | A compact estimate that combines KV cache storage pressure and per-token KV read pressure. Increase it with context length, batch size and parallel sessions. | Runtime memory report, engine logs, or estimate from layers, context, KV heads and dtype. |
| FLOPs per Active Parameter | Use 2 as a practical first approximation for multiply-add dominated inference. Increase if you want to include additional overhead. | Keep default first, then calibrate against measured tokens/s. |
| GPU VRAM | Physical VRAM minus any reservation you want to exclude. GPU 1 also has OS buffer. | GPU specs, nvidia-smi, GPU-Z, HWiNFO. |
| GPU Memory Bandwidth | VRAM bandwidth, not PCIe bandwidth. | Search "GPU name memory bandwidth". |
| Effective TFLOPS | Realistic compute throughput for your inference workload. Usually lower than peak spec-sheet TFLOPS. | Start from specs, reduce conservatively, then calibrate with benchmark. |
| PCIe Gen and Lanes | Electrical PCIe generation and lane count for each GPU. Physical slot size can differ from electrical lane count. | Motherboard manual, BIOS, GPU-Z, HWiNFO, lspci, nvidia-smi topo -m. |
| Activation Transfer | Approximate activation payload crossing device boundaries per generated token. Increase for batching or multiple sessions. | Use default first. Calibrate if PCIe transfer seems too low or too high. |
3. Dense vs MoE
Dense model
Nearly every parameter participates for every token. A 70B dense model behaves roughly like 70B active parameters per generated token.
MoE model
The model stores many experts, but the router activates only selected experts for each token. Storage follows total parameters. Compute and active reads follow active parameters.
4. Roofline timing
Each device block has a memory time and a compute time. The larger one wins. This is the simplified Roofline rule used by the calculator.
If the box says BW bound, memory bandwidth dominates. If it says
FLOPS bound, compute dominates. Effective TFLOPS should normally be lower than
theoretical peak because real inference kernels are not ideal peak benchmarks.
5. Layer split chain
The calculator models contiguous layer blocks. GPU 1 may receive the first part of the network, GPU 2 the next part, GPU 3 the next part, and so on. For each generated token, the activation travels through the chain.
This is why the visual boxes show processing time per block and the lines show transfer time. If everything fits in VRAM, PCIe usually carries activations, not full model weights.
6. Tensor split and tensor parallelism
Tensor split means a single layer's tensors are divided across multiple GPUs. Instead of GPU 1 owning layers 0 to 20 and GPU 2 owning layers 21 to 40, both GPUs participate inside the same layer. This can combine compute and memory capacity, but it also introduces frequent communication inside every layer.
Layer split
Communication happens mainly between layer blocks. It is usually one activation transfer per boundary per token. This calculator models this mode directly.
Tensor split
Communication can happen inside every layer, often through all-reduce, all-gather or reduce-scatter style operations. PCIe can become much more important.
How to use this calculator for tensor split: do not treat the result as exact. Use it as a rough upper bound only when interconnect is very fast and scaling is efficient. For consumer GPUs connected through PCIe, tensor split may be limited by communication, so real speed can be lower than a layer-split estimate.
| Situation | How to approximate |
|---|---|
| Runtime uses layer/block split | Use the calculator normally. Enter each GPU separately. |
| Runtime uses tensor split across PCIe GPUs | Use normal GPU values, but expect extra communication not fully modeled. Increase Activation Transfer and reduce Effective TFLOPS to calibrate. |
| Runtime uses tensor split over NVLink/high-speed interconnect | The model can be closer after calibration. Effective TFLOPS may be higher because all GPUs cooperate inside layers. |
| You want a quick virtual-GPU approximation | Sum VRAM, estimate combined TFLOPS and bandwidth with an efficiency factor, then calibrate. This file keeps physical GPUs because it is primarily a layer-chain estimator. |
7. Parallel multi-session inference
A single chat session sends one token through the full chain, then the next token. Multiple sessions can be interleaved or batched. Then several tokens from different sessions can occupy different stages of the GPU chain at the same time.
Latency estimate
For one interactive session, use the calculator normally. The token must pass through all blocks, so latency is the sum of block and transfer times.
Throughput estimate
For many sessions, steady-state throughput can become limited by the slowest stage rather than the sum of all stages, but only if batching/pipelining is actually used.
How to fill the calculator for multiple sessions:
- Multiply KV Cache by the number of active sessions or by the effective batch size.
- Increase Activation Transfer roughly with batch size because more activations cross the GPU chain.
- Raise Effective TFLOPS only if your engine really becomes more compute-efficient with batching.
- Interpret the top tokens/s metric as single-session decode speed unless you manually calibrate for batching.
8. PCIe, lanes and P2P
Each GPU has its own host link. The calculator converts PCIe generation and lane count into approximate one-direction payload bandwidth.
| Generation | Approx payload per lane per direction | x16 approximate |
|---|---|---|
| PCIe 3.0 | 0.985 GB/s | 15.75 GB/s |
| PCIe 4.0 | 1.969 GB/s | 31.50 GB/s |
| PCIe 5.0 | 3.938 GB/s | 63.01 GB/s |
| PCIe 6.0 | 7.563 GB/s | 121.01 GB/s |
PCIe matters little when only small activations move between a few GPU blocks. It matters a lot when CPU/RAM participates, P2P is disabled, activation size is large, or true tensor parallelism communicates inside every layer.
9. CPU and RAM
In this calculator, CPU/RAM means spilled layers are executed by CPU and read from system RAM. PCIe is used only when activations cross between GPU blocks and CPU blocks.
10. Reading results
- Single-session throughput: estimated tokens per second for one autoregressive session using the best estimated layer-chain hardware split.
- Latency: estimated single-token latency for one session.
- Layer Split: layer count assigned to each used device block.
- Placement Status: whether all layers fit into GPU VRAM or CPU/RAM is used.
- Box time: block processing time on that device, excluding transfer labels between boxes.
- BW bound / FLOPS bound: which part of the Roofline estimate dominates that block.
11. Calibration workflow
- Choose one real model, quantization and context length.
- Measure real decode tokens/s with your inference engine.
- Enter exact GPU VRAM, memory bandwidth and PCIe lane configuration.
- Adjust Effective TFLOPS and Activation Transfer until the estimate is close.
- Use the calibrated setup to compare GPU combinations, PCIe lane layouts and MoE/Dense alternatives.
Estimate too fast?
Lower Effective TFLOPS, increase Activation Transfer, increase KV Cache, or account for framework overhead.
Estimate too slow?
Raise Effective TFLOPS, reduce Activation Transfer, verify P2P mode, or check quantization bytes.
12. Limits and expectations
This is a compact theoretical estimator. It does not fully simulate kernel launch overhead, CUDA graph behavior, NUMA, page faults, exact KV-cache kernels, tensor-parallel collectives, speculative decoding, continuous batching, prompt prefill, driver behavior, or engine-specific optimizations.
Good use cases
Hardware iteration, rough bottleneck discovery, checking if a model fits, comparing PCIe lane configurations, estimating whether an extra GPU helps, and explaining why a setup is bandwidth-bound or compute-bound.
Bad use cases
Claiming exact production throughput, modeling high-batch server inference without calibration, predicting prompt prefill speed, or treating tensor split over PCIe as identical to layer split.