Vocabulary

The vocabulary of an LLM is the fixed set of tokens the model can encode and emit — typically 32,000 to 200,000 entries — determined by the tokenization algorithm trained on the model's corpus. Vocabulary choice is a foundational architectural decision: it sets the embedding matrix size (vocab_size × hidden_dim, often the largest single parameter block in the model — Llama 3's 128K vocab × 8192 hidden dim = 1B+ embedding parameters alone), governs cross-lingual efficiency (a small vocab forces non-English text into many tokens), and determines what content can be represented natively versus must be reconstructed from byte fragments. Vocabulary sizes have grown over time: BERT (2018) used 30K WordPiece, GPT-2 (2019) used 50K BPE, GPT-3 (2020) extended to 50K with byte-level BPE, GPT-4 stabilized at 100K (cl100k_base), GPT-4o jumped to 200K (o200k_base), and Gemma went further to 256K. The trade-off is straightforward: a larger vocabulary means more parameters in the embedding layer (cost) but fewer tokens per piece of text (efficiency at inference). Multilingual models benefit disproportionately from large vocabularies because they need representation across many scripts. Vocabulary cannot be easily extended after training — adding new tokens requires retraining or careful initialization plus continued pretraining. This is why specialized vocabularies for medical (BioBERT), legal, code (CodeLlama, StarCoder, DeepSeek-Coder use code-extended vocabularies), or specific languages are typically trained as separate models. A practical inspection: from transformers import AutoTokenizer; tok = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B'); print(tok.vocab_size, len(tok)); print(list(tok.get_vocab().items())[:20]). AI governance teams record vocabulary version alongside the model in the model registry because tokenizer-model mismatch silently corrupts behavior in ways that are hard to debug.

Vocabulary management from 25 years of content-controlled-vocabulary work: Centralpoint has managed controlled vocabularies — taxonomies, audience tags, regulatory codes, classification schemes — for 25 years of enterprise content. Tokenizer vocabularies extend that discipline to model artifacts. Vocabularies version-controlled on-premise, tokens meter per skill, and vocabulary-aware chatbots deploy through one line of JavaScript.

Related Keywords:
Vocabulary,Vocabulary,Oxcyon, AI, AI Governance, Generative AI, Inference, Inference, Inferencing, RAG, Prompts, Skills Manager,

Back