Subword Tokenization

Subword tokenization is the approach of splitting text into units smaller than whole words but larger than individual characters, the dominant tokenization strategy in modern LLMs. Common subword algorithms include Byte-Pair Encoding (BPE), WordPiece, SentencePiece (Unigram), and byte-level BPE, each producing slightly different segmentation behaviors. Subword tokenization solves several problems that whole-word tokenization cannot: it handles out-of-vocabulary words gracefully by decomposing them into known fragments, keeps vocabulary size manageable (typically 30,000 to 200,000 tokens), and adapts naturally to morphologically rich languages where stems and affixes recur. The trade-off is that single semantic units sometimes split unintuitively — for example "OpenAI" might tokenize as ["Open", "AI"] or ["Open", "A", "I"] depending on the tokenizer. AI governance teams pay attention to subword tokenization for multilingual fairness because non-English languages often produce more tokens per character, inflating cost and reducing effective context length. The choice of tokenizer is a model-architecture decision that AI compliance documentation captures alongside training data and architecture.

Subword tokenization across Centralpoint models: Centralpoint handles per-model subword tokenization transparently — tiktoken for OpenAI, proprietary for Claude, SentencePiece for Gemini, custom BPE for Llama — and meters tokens uniformly so cost reporting stays consistent. The model-agnostic platform keeps prompts local and deploys chatbots through one line of JavaScript across portals.

Related Keywords:
Subword Tokenization,,

Back