Alignment Tax

Alignment tax is the term used to describe the trade-off between safety properties and raw capability that often emerges when LLMs undergo RLHF, refusal training, Constitutional AI, or other safety-focused post-training. The phenomenon was first widely discussed in the InstructGPT paper (2022), which observed that safety-tuned models performed worse than their base models on some benchmarks despite being more useful as assistants. The trade-off has multiple dimensions: over-refusal (declining benign requests), reduced creativity (hedging and disclaimer-heavy responses), capability regression on niche tasks not represented in the safety training data, and increased verbosity. The alignment tax can sometimes be eliminated through careful training data composition and curriculum design, and modern alignment techniques like DPO, ORPO, and KTO have generally reduced the tax compared to early RLHF implementations. AI governance teams discuss alignment tax explicitly in their model selection because the trade-off may be acceptable for consumer products but unacceptable for specialized scientific or technical applications. The term remains contested — some researchers prefer "alignment dividend" to describe the helpfulness gains from alignment, arguing that the "tax" framing overstates the costs.

Capability-vs-safety tradeoffs through Centralpoint: Centralpoint lets you route different workloads to different LLMs — strict policy-aligned for customer-facing chatbots, more capable variants for internal tooling — all in one model-agnostic stack. Tokens are metered per skill, prompts stay local, and audience-aware chatbots deploy through one line of JavaScript on any portal.

Related Keywords:
Alignment Tax,,

Back