WordPiece
WordPiece is a subword tokenization algorithm introduced by Google in 2012 and popularized by BERT in 2018, similar to BPE but with a different merge criterion based on maximum likelihood rather than frequency. WordPiece marks subword continuations with a special prefix (typically ##) so that tokenization can be reversed unambiguously back to the original text — for example "unbelievable" might tokenize as ["un", "##believ", "##able"]. BERT, RoBERTa, ELECTRA, and many other transformer encoder models use WordPiece tokenizers, typically with vocabularies of around 30,000 tokens. WordPiece is less common in modern decoder-only generative
LLMs, which have largely adopted byte-level BPE or SentencePiece, but remains widely deployed in
embedding models including the Sentence-BERT family. AI governance teams documenting
embedding pipelines track the tokenizer because changing it requires retraining or repurposing every downstream component. WordPiece's deterministic behavior and the ## continuation convention make tokenized text easy to inspect during AI compliance audits.
WordPiece in Centralpoint embedding pipelines: Centralpoint supports WordPiece-tokenized BERT-family
embedding models alongside BPE and SentencePiece-based models, all in one model-agnostic stack. The platform meters tokens, keeps prompts local, and deploys embedding-aware chatbots through one line of JavaScript with full audit logs for AI compliance.
Related Keywords:
WordPiece,
,