ZeRO

ZeRO, short for Zero Redundancy Optimizer, is a memory optimization technique introduced by Microsoft Research in 2019 that shards optimizer states, gradients, and (optionally) model parameters across data-parallel GPU ranks, eliminating the memory redundancy of traditional data parallelism. ZeRO comes in three stages: Stage 1 shards optimizer states (most memory savings per implementation cost), Stage 2 also shards gradients, and Stage 3 also shards parameters (equivalent to FSDP). DeepSpeed's ZeRO implementation enabled training of Microsoft's 17B Turing-NLG and 530B Megatron-Turing NLG, demonstrating that ZeRO Stage 3 could scale to trillion-parameter models. ZeRO-Offload extends the technique by moving optimizer states to CPU memory, and ZeRO-Infinity adds NVMe storage as a third tier. The ZeRO family of techniques transformed large-scale training by making frontier-scale models trainable on commodity GPU clusters rather than requiring custom hardware. AI governance teams document the ZeRO stage and offloading configuration as part of their training infrastructure lineage.

ZeRO-trained models in Centralpoint: Centralpoint sits above whatever distributed training stack produced your models — DeepSpeed ZeRO, PyTorch FSDP, Megatron — with consistent metering across the LLM fleet. The model-agnostic platform keeps prompts local and deploys chatbots through one line of JavaScript with audit-ready governance.

Related Keywords:
ZeRO,,

Back