Jaccard Similarity
Jaccard similarity measures the overlap between two sets as the size of their intersection divided by the size of their union, producing a value between 0 (disjoint sets) and 1 (identical sets). The metric is the natural choice for comparing sparse representations like keyword sets, n-gram shingles, and bag-of-words document signatures, and it underlies the MinHash algorithm widely used for large-scale de-duplication. Jaccard similarity is not typically used directly with dense neural
embeddings — those use cosine, dot product, or Euclidean distance — but remains foundational in document de-duplication, plagiarism detection, and sparse retrieval systems including parts of
SPLADE. The metric is the foundation of MinHash LSH, which approximates Jaccard similarity using small fixed-size signatures that scale to billion-document corpora. AI governance teams use Jaccard-based deduplication in training data curation pipelines to identify and remove near-duplicates that would otherwise distort model training and bias evaluation. Modern
RAG systems often combine sparse Jaccard-style retrieval with dense vector retrieval through Reciprocal Rank Fusion for best-of-both-worlds hybrid search.
Jaccard + Centralpoint for hybrid retrieval: Centralpoint supports hybrid retrieval combining Jaccard-style sparse search with dense vector search through Reciprocal Rank Fusion, all governed under one model-agnostic platform. Tokens are metered per skill, prompts stay local, and hybrid-search chatbots embed across portals with one line of JavaScript.
Related Keywords:
Jaccard Similarity,
,