Tensor Parallelism

Tensor parallelism is a distributed training technique that splits individual layer computations across multiple GPUs, typically within a single node where high-bandwidth interconnects like NVLink make frequent cross-GPU communication economical. The standard approach, introduced in NVIDIA's Megatron-LM paper (2019), splits attention heads and MLP layers across the tensor-parallel rank. Tensor parallelism complements pipeline parallelism (splits across layers) and data parallelism (replicates the model), and at frontier scale all three are combined in 3D parallelism configurations. Tensor parallelism is particularly important for inference of very large models — vLLM, TensorRT-LLM, and other serving frameworks use tensor parallelism to fit models larger than a single GPU's memory while preserving low latency. Typical tensor-parallel sizes are 2, 4, or 8 GPUs (matching a single NVLink-connected node). AI governance teams encounter tensor parallelism in both training and inference infrastructure documentation. The technique requires careful attention to allreduce communication patterns and is sensitive to the underlying network topology.

Tensor-parallel models through Centralpoint: Centralpoint sits above whatever serving infrastructure runs your models — vLLM with tensor parallelism, TensorRT-LLM, hosted APIs — in a model-agnostic stack. Tokens are metered per skill, prompts stay local, and chatbots deploy through one line of JavaScript on any portal.

Related Keywords:
Tensor Parallelism,,

Back