GPTQ

GPTQ is a one-shot quantization technique introduced by Frantar et al. in 2022 that produces 3-bit or 4-bit quantized LLMs with minimal accuracy loss using approximate second-order information (an approximation of the Hessian). The technique applies layer-by-layer quantization with calibration data, optimizing weights to minimize the squared error of layer outputs rather than just the weights themselves. GPTQ produces quantized models that are typically within 1-2 perplexity points of FP16 baselines on most language modeling benchmarks, despite using only 25-30% of the storage. AutoGPTQ and GPTQ-for-LLaMa are the standard open-source implementations, integrated with vLLM, TensorRT-LLM, Hugging Face Transformers, and ExLlamaV2 (a high-performance GPTQ inference engine). GPTQ was the dominant quantization approach in early-to-mid 2023, before AWQ emerged as a faster and slightly higher-quality alternative for most use cases. Both techniques remain in production use depending on hardware support, available tooling, and target model architectures. AI governance teams document the quantization method and calibration dataset for AI compliance lineage.

GPTQ-quantized models with Centralpoint: Centralpoint supports GPTQ-quantized models alongside AWQ, GGUF, and full-precision variants in one model-agnostic stack. Tokens are metered per skill, prompts stay local, supports generative and embedded models, and deploys chatbots through one line of JavaScript with audit-ready governance.


Related Keywords:
GPTQ,,