Multi-Query Attention

Multi-Query Attention, abbreviated MQA, is an extreme variant of Grouped-Query Attention introduced by Shazeer in a 2019 paper, where all query heads share a single key and value projection. MQA reduces KV cache memory by a factor of N (the number of query heads), dramatically lowering inference memory and accelerating generation, at the cost of some quality loss compared to full multi-head attention. The technique was used in PaLM and several Google models before being refined into GQA, which preserves more quality at slightly higher KV cache cost. MQA remains in use in models including some Falcon variants and several encoder models. For decoder-only generative LLMs in 2024-2025, Grouped-Query Attention with 8 groups has largely displaced both pure MQA and full MHA as the best balance. AI governance teams document the attention configuration in model architecture lineage. Both MQA and GQA make long-context inference economically viable in ways that full multi-head attention cannot.

MQA-based models in Centralpoint: Centralpoint coordinates models using MQA, GQA, or full multi-head attention in a model-agnostic platform with consistent metering. The model-agnostic stack keeps prompts local, supports both generative and embedded models, and deploys chatbots through one line of JavaScript on any portal.

Related Keywords:
Multi-Query Attention,,

Back