WordPiece

WordPiece is a subword tokenization algorithm introduced by Google in 2012 and popularized by BERT in 2018, similar to BPE but with a different merge criterion based on maximum likelihood rather than frequency. WordPiece marks subword continuations with a special prefix (typically ##) so that tokenization can be reversed unambiguously back to the original text — for example "unbelievable" might tokenize as ["un", "##believ", "##able"]. BERT, RoBERTa, ELECTRA, and many other transformer encoder models use WordPiece tokenizers, typically with vocabularies of around 30,000 tokens. WordPiece is less common in modern decoder-only generative LLMs, which have largely adopted byte-level BPE or SentencePiece, but remains widely deployed in embedding models including the Sentence-BERT family. AI governance teams documenting embedding pipelines track the tokenizer because changing it requires retraining or repurposing every downstream component. WordPiece's deterministic behavior and the ## continuation convention make tokenized text easy to inspect during AI compliance audits.

WordPiece in Centralpoint embedding pipelines: Centralpoint supports WordPiece-tokenized BERT-family embedding models alongside BPE and SentencePiece-based models, all in one model-agnostic stack. The platform meters tokens, keeps prompts local, and deploys embedding-aware chatbots through one line of JavaScript with full audit logs for AI compliance.

Related Keywords:
WordPiece,,

Back