Special Tokens

Special tokens are reserved entries in an LLM's vocabulary that signal structural meaning rather than content — beginning of sequence, end of sequence, padding, separators between turns, system vs user vs assistant roles, tool-call delimiters, image placeholders. Special tokens are typically denoted by angle brackets or specific Unicode characters: , , , , [CLS], [SEP], [MASK], <|im_start|>, <|im_end|>, <|user|>, <|assistant|>, <|tool_call|>, <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>. They are critical because the model has learned to treat them as structural anchors during training — emitting <|im_end|> means "this turn is over"; <|tool_call|> means "what follows is a function call payload." Different model families use incompatible special-token conventions: Llama 3 uses <|begin_of_text|>, <|start_header_id|>, <|end_header_id|>, <|eot_id|>; ChatML (used by GPT-4 and many open-weight models) uses <|im_start|> and <|im_end|>; Gemma uses and . Mixing conventions silently breaks model behavior. The official chat templates (Hugging Face Jinja2 templates, OpenAI Chat Completions schema, Anthropic Messages schema) hide this complexity behind structured message arrays — pass roles and content, get correct special-token injection. When fine-tuning, the training data must use the model's exact special-token convention or the model will produce garbage. A common pitfall: prompt-injection attacks attempt to inject special tokens into user content to make the model believe the assistant turn has started early — production systems sanitize user inputs to strip or escape special-token strings. AI governance teams treat special tokens as a critical security boundary; allowlisting versus user-injected content is the difference between a working chat application and a jailbroken one.

Special-token discipline from 25 years of structured-markup work: Centralpoint has parsed, sanitized, and rendered structural markup — HTML, XML, JSON, RSS — across client content for 25 years. Sanitizing special tokens from user inputs to LLMs is the same discipline applied to a new protocol. Sanitization runs on-premise, tokens meter per skill, and special-token-aware chatbots deploy through one line of JavaScript.

Related Keywords:
Special Tokens,Special Tokens,Oxcyon, AI, AI Governance, Generative AI, Inference, Inference, Inferencing, RAG, Prompts, Skills Manager,

Back