LLM Multi-GPU Optimizer

Purpose

1. Mental model

This calculator estimates autoregressive decode speed for local LLM inference. Decode means the model generates one new token at a time. The token passes through all transformer layers. If layers are distributed across GPUs, the activation moves through a GPU chain.

Storeweights and KV cache occupy VRAM or RAM

Readactive weight bytes and KV bytes are read for each token

Computeactive parameters create FLOPs

Transferactivations cross device boundaries

single_session_latency = sum(device_block_time) + sum(boundary_transfer_time) single_session_tokens_per_second = 1 / single_session_latency

The absolute number is an estimate. The most useful output is the bottleneck map: whether the configuration is limited by VRAM, VRAM bandwidth, effective TFLOPS, PCIe, RAM bandwidth or CPU compute.

Inputs

2. How to fill the calculator

Use the calculator as a hardware planning tool. Start with real model values, then real GPU specifications, then calibrate uncertain values such as effective TFLOPS and activation transfer size.

Input	What to enter	Where to find it
Total Parameters	Total stored parameters. For Dense, this is the same number you normally see in the model name. For MoE, this includes all experts, even inactive ones.	Model card, paper, config file, architecture report, Hugging Face page.
Active Parameters / Token	For Dense, use total parameters. For MoE, use the active parameter count per token. If unknown, estimate from active experts plus shared dense parts.	Look for "active parameters", "experts per token", "top-k", "routed experts", "shared experts".
Bytes per Parameter	Approximate quantized storage size. FP16/BF16 = 2, FP8/Q8 = 1, Q4 = 0.5, Q3 = 0.375 before metadata overhead.	Quantization format, model filename, runtime loader output.
KV Cache Read + Storage	A compact estimate that combines KV cache storage pressure and per-token KV read pressure. Increase it with context length, batch size and parallel sessions.	Runtime memory report, engine logs, or estimate from layers, context, KV heads and dtype.
FLOPs per Active Parameter	Use 2 as a practical first approximation for multiply-add dominated inference. Increase if you want to include additional overhead.	Keep default first, then calibrate against measured tokens/s.
GPU VRAM	Physical VRAM minus any reservation you want to exclude. GPU 1 also has OS buffer.	GPU specs, nvidia-smi, GPU-Z, HWiNFO.
GPU Memory Bandwidth	VRAM bandwidth, not PCIe bandwidth.	Search "GPU name memory bandwidth".
Effective TFLOPS	Realistic compute throughput for your inference workload. Usually lower than peak spec-sheet TFLOPS.	Start from specs, reduce conservatively, then calibrate with benchmark.
PCIe Gen and Lanes	Electrical PCIe generation and lane count for each GPU. Physical slot size can differ from electrical lane count.	Motherboard manual, BIOS, GPU-Z, HWiNFO, lspci, nvidia-smi topo -m.
Activation Transfer	Approximate activation payload crossing device boundaries per generated token. Increase for batching or multiple sessions.	Use default first. Calibrate if PCIe transfer seems too low or too high.

For a normal single-user interactive estimate: enter one model, one context size, one active token path, and leave Activation Transfer near the default. For multiple parallel sessions: scale KV Cache and Activation Transfer upward.

Architecture

3. Dense vs MoE

Dense model

Nearly every parameter participates for every token. A 70B dense model behaves roughly like 70B active parameters per generated token.

active_parameters ≈ total_parameters

MoE model

The model stores many experts, but the router activates only selected experts for each token. Storage follows total parameters. Compute and active reads follow active parameters.

active_parameters < total_parameters

Tokenrouter chooses top-k experts

Expert 1 active

Expert 2 stored

Expert 3 active

Expert 4 stored

Expert 5 stored

Expert 6 stored

Expert 7 stored

Expert 8 stored

stored_GB = total_parameters_B * bytes_per_parameter + KV_cache_GB active_read_GB_per_token = active_parameters_B * bytes_per_parameter + KV_cache_GB compute_FLOPs_per_token = active_parameters_B * 1e9 * FLOPs_per_active_parameter

MoE does not remove all dense work. Attention, router, normalization, shared experts and KV cache still exist. Active parameters are averaged across the whole model, so real layer-level load may be uneven.

Performance model

4. Roofline timing

Each device block has a memory time and a compute time. The larger one wins. This is the simplified Roofline rule used by the calculator.

memory_time = active_read_GB_for_block / memory_bandwidth_GBps compute_time = FLOPs_for_block / effective_FLOPS block_time = max(memory_time, compute_time)

If the box says BW bound, memory bandwidth dominates. If it says FLOPS bound, compute dominates. Effective TFLOPS should normally be lower than theoretical peak because real inference kernels are not ideal peak benchmarks.

Modeled by calculator

5. Layer split chain

The calculator models contiguous layer blocks. GPU 1 may receive the first part of the network, GPU 2 the next part, GPU 3 the next part, and so on. For each generated token, the activation travels through the chain.

Input tokenembedding and first layer block

GPU 1 blockreads local weights, computes layers

GPU 2 blockreceives activation, continues

Output blockfinal layers and logits

layer_block_time = assigned_layers * per_layer_time boundary_time = activation_GB / link_GBps single_session_latency = sum(layer_block_time) + sum(boundary_time)

This is why the visual boxes show processing time per block and the lines show transfer time. If everything fits in VRAM, PCIe usually carries activations, not full model weights.

The optimizer searches the best estimated contiguous layer-chain ordering. Real inference engines may constrain device order, CPU offload placement or tensor-parallel communication.

Not the same as layer split

6. Tensor split and tensor parallelism

Tensor split means a single layer's tensors are divided across multiple GPUs. Instead of GPU 1 owning layers 0 to 20 and GPU 2 owning layers 21 to 40, both GPUs participate inside the same layer. This can combine compute and memory capacity, but it also introduces frequent communication inside every layer.

Layer N shard AGPU 1 computes part of matmul

Layer N shard BGPU 2 computes another part, then results are synchronized

Layer split

Communication happens mainly between layer blocks. It is usually one activation transfer per boundary per token. This calculator models this mode directly.

Tensor split

Communication can happen inside every layer, often through all-reduce, all-gather or reduce-scatter style operations. PCIe can become much more important.

ideal_tensor_parallel_time ≈ max( sharded_memory_time, sharded_compute_time, collective_communication_time )

How to use this calculator for tensor split: do not treat the result as exact. Use it as a rough upper bound only when interconnect is very fast and scaling is efficient. For consumer GPUs connected through PCIe, tensor split may be limited by communication, so real speed can be lower than a layer-split estimate.

Situation	How to approximate
Runtime uses layer/block split	Use the calculator normally. Enter each GPU separately.
Runtime uses tensor split across PCIe GPUs	Use normal GPU values, but expect extra communication not fully modeled. Increase Activation Transfer and reduce Effective TFLOPS to calibrate.
Runtime uses tensor split over NVLink/high-speed interconnect	The model can be closer after calibration. Effective TFLOPS may be higher because all GPUs cooperate inside layers.
You want a quick virtual-GPU approximation	Sum VRAM, estimate combined TFLOPS and bandwidth with an efficiency factor, then calibrate. This file keeps physical GPUs because it is primarily a layer-chain estimator.

If a tool calls its option "tensor split", verify what it actually does. Some runtimes use the term for weight distribution ratios, while true tensor parallelism implies frequent synchronization inside layers.

Throughput mode

7. Parallel multi-session inference

A single chat session sends one token through the full chain, then the next token. Multiple sessions can be interleaved or batched. Then several tokens from different sessions can occupy different stages of the GPU chain at the same time.

Session A

Session B

Session C

Pipeline intuitionafter warm-up, different sessions can be in different GPU blocks at the same time if the inference engine schedules them that way.

Latency estimate

For one interactive session, use the calculator normally. The token must pass through all blocks, so latency is the sum of block and transfer times.

Throughput estimate

For many sessions, steady-state throughput can become limited by the slowest stage rather than the sum of all stages, but only if batching/pipelining is actually used.

single_session_latency ≈ sum(stage_times) ideal_pipeline_throughput_with_many_sessions ≈ 1 / max(stage_time) real_multi_session_throughput = lower because of scheduling, KV cache, batching overhead and memory pressure

How to fill the calculator for multiple sessions:

Multiply KV Cache by the number of active sessions or by the effective batch size.
Increase Activation Transfer roughly with batch size because more activations cross the GPU chain.
Raise Effective TFLOPS only if your engine really becomes more compute-efficient with batching.
Interpret the top tokens/s metric as single-session decode speed unless you manually calibrate for batching.

For a practical UI, a future version could add "Parallel Sessions" and "Batch Efficiency" sliders. For now, you can approximate this manually through KV Cache, Activation Transfer and Effective TFLOPS.

Interconnect

8. PCIe, lanes and P2P

Each GPU has its own host link. The calculator converts PCIe generation and lane count into approximate one-direction payload bandwidth.

Generation	Approx payload per lane per direction	x16 approximate
PCIe 3.0	0.985 GB/s	15.75 GB/s
PCIe 4.0	1.969 GB/s	31.50 GB/s
PCIe 5.0	3.938 GB/s	63.01 GB/s
PCIe 6.0	7.563 GB/s	121.01 GB/s

Direct P2PGPU to GPU activation path

Effective linkconservative min(PCIe A, PCIe B)

Host fallbackGPU A to RAM to GPU B

Direct P2P: T = activation_GB / min(PCIe_A, PCIe_B) Via Host RAM: T = activation_GB/PCIe_A + 2*activation_GB/RAM_BW + activation_GB/PCIe_B

PCIe matters little when only small activations move between a few GPU blocks. It matters a lot when CPU/RAM participates, P2P is disabled, activation size is large, or true tensor parallelism communicates inside every layer.

Via Host RAM is a conservative staging estimate. Real behavior depends on DMA, chipset topology, NUMA, pinned memory and whether transfers overlap.

Offload

9. CPU and RAM

In this calculator, CPU/RAM means spilled layers are executed by CPU and read from system RAM. PCIe is used only when activations cross between GPU blocks and CPU blocks.

GPU blockactivation exits GPU

CPU/RAM blockweights read from RAM, compute on CPU

CPU_memory_time = active_read_GB_CPU_block / RAM_bandwidth_GBps CPU_compute_time = FLOPs_CPU_block / CPU_effective_FLOPS CPU_block_time = max(CPU_memory_time, CPU_compute_time)

This is not the same as GPU weight streaming from RAM. If your runtime streams weights to GPU, PCIe carries weight traffic and should be modeled much more aggressively.

Output

10. Reading results

Single-session throughput: estimated tokens per second for one autoregressive session using the best estimated layer-chain hardware split.
Latency: estimated single-token latency for one session.
Layer Split: layer count assigned to each used device block.
Placement Status: whether all layers fit into GPU VRAM or CPU/RAM is used.
Box time: block processing time on that device, excluding transfer labels between boxes.
BW bound / FLOPS bound: which part of the Roofline estimate dominates that block.

A fast setup is not always the one with the most VRAM. The best setup balances storage capacity, per-layer speed and transfer overhead.

Accuracy

11. Calibration workflow

Choose one real model, quantization and context length.
Measure real decode tokens/s with your inference engine.
Enter exact GPU VRAM, memory bandwidth and PCIe lane configuration.
Adjust Effective TFLOPS and Activation Transfer until the estimate is close.
Use the calibrated setup to compare GPU combinations, PCIe lane layouts and MoE/Dense alternatives.

Estimate too fast?

Lower Effective TFLOPS, increase Activation Transfer, increase KV Cache, or account for framework overhead.

Estimate too slow?

Raise Effective TFLOPS, reduce Activation Transfer, verify P2P mode, or check quantization bytes.

Boundaries

12. Limits and expectations

This is a compact theoretical estimator. It does not fully simulate kernel launch overhead, CUDA graph behavior, NUMA, page faults, exact KV-cache kernels, tensor-parallel collectives, speculative decoding, continuous batching, prompt prefill, driver behavior, or engine-specific optimizations.

Good use cases

Hardware iteration, rough bottleneck discovery, checking if a model fits, comparing PCIe lane configurations, estimating whether an extra GPU helps, and explaining why a setup is bandwidth-bound or compute-bound.

Bad use cases

Claiming exact production throughput, modeling high-batch server inference without calibration, predicting prompt prefill speed, or treating tensor split over PCIe as identical to layer split.

Use the result as an engineering estimate. The ranking of configurations is usually more trustworthy than the absolute number.