Grouped-Query Attention

Grouped-Query Attention, abbreviated GQA, is a multi-head attention variant introduced by Ainslie et al. in a 2023 Google paper that shares key and value projections across groups of query heads, reducing KV cache memory while preserving most of the quality of full multi-head attention. In standard multi-head attention with N heads, each head has its own K and V projections, requiring N KV cache slots per token. GQA groups the N heads into G groups (with G typically 8 or fewer) that share KV projections, reducing KV cache memory by a factor of N/G. The technique is a generalization of Multi-Query Attention (which is GQA with G=1) and standard MHA (which is GQA with G=N). GQA has become the dominant attention variant in modern large LLMs: Llama 3 70B, Mistral Large, Qwen 2 72B, and most other 70B-plus models use GQA with 8 groups. The KV cache savings enable longer context, larger batches, and lower inference cost. AI governance teams document the GQA grouping factor as part of model architecture lineage.

GQA-based models through Centralpoint: Centralpoint operates above whatever attention variant powers your models — GQA, MQA, full MHA — in a model-agnostic platform. Tokens are metered per skill and audience, prompts stay local, and chatbots deploy through one line of JavaScript with audit-ready governance.

Related Keywords:
Grouped-Query Attention,,

Back