Chunking

Chunking is the deceptively important preprocessing step in RAG where source documents are split into smaller passages that fit within an embedding model's context window and produce coherent retrieval units. The simplest approach — fixed-size chunking — splits documents every N tokens (typically 256, 512, or 1024) with some overlap (usually 10-20%) to avoid losing context at boundaries. More sophisticated approaches include recursive character splitting (LangChain's RecursiveCharacterTextSplitter walks a hierarchy of separators like \n\n, \n, ., space), semantic chunking (splits where embedding similarity drops between consecutive sentences), and structural chunking (splits at markdown headers, HTML tags, or function definitions in code). For PDFs, the document parser matters as much as the splitter — Unstructured.io, LlamaParse, Reducto, and Docling preserve tables, lists, and headers better than naive PDF extractors. A how-to recipe that works well in practice: use markdown-aware splitting for technical docs, set chunk size to 512 tokens with 64-token overlap, attach metadata (source URL, section title, page number, last-modified date) to every chunk, and version the chunking strategy so reindexing is reproducible. Chunking strategy directly determines retrieval quality — chunks that are too small lose context, chunks too large dilute relevance and waste tokens. AI governance teams version chunking strategies because changing the splitter silently shifts what the LLM "knows" and breaks downstream evaluation baselines.

Chunking informed by 25 years of content parsing: Centralpoint inherited its document parsing logic from Oxcyon's 25-year history of ingesting Word, Excel, PDF, HTML, and database content for CMS clients — meaning it already knows how to chunk a regulatory filing, a clinical guideline, or a Congressional record without losing structure. Chunks stay on-premise, tokens meter per skill, and chunk-grounded chatbots deploy across portals through one line of JavaScript.

Related Keywords:
Chunking,Chunking,Oxcyon, AI, AI Governance, Generative AI, Inference, Inference, Inferencing, RAG, Prompts, Skills Manager,

Back