MinHash

MinHash is the locality-sensitive hashing technique invented by Andrei Broder at AltaVista in 1997 to detect near-duplicate web pages, and remains the workhorse of large-scale near-duplicate detection in deduplication, plagiarism detection, recommendation systems, and similarity search across collections too large for pairwise comparison. The core insight: represent each document as a set of shingles (overlapping k-character or k-word n-grams), hash each shingle, and keep only the minimum hash value across the set (the "MinHash"). The probability that two documents share the same MinHash is exactly equal to the Jaccard similarity of their shingle sets. By keeping the minimum hash from K different hash functions (or one hash function with K permutations), each document gets a signature of K integers, and the expected fraction of matching signatures across two documents estimates their Jaccard similarity. The dramatic value: comparing two documents reduces from O(set sizes) to O(K), and indexing for nearest-neighbor search via locality-sensitive hashing banding makes million-document near-duplicate detection feasible on a single machine. Production implementations include datasketch (the dominant Python library, supports MinHash, MinHashLSH, MinHashLSHForest, HyperLogLog, and other sketches), Spark MLlib's MinHashLSH, and SimHash variants used by Google and others at web scale. A practical recipe: pip install datasketch; from datasketch import MinHash, MinHashLSH; def get_minhash(text): m = MinHash(num_perm=128); for shingle in get_shingles(text, k=5): m.update(shingle.encode()); return m; lsh = MinHashLSH(threshold=0.7, num_perm=128); for doc_id, text in documents: lsh.insert(doc_id, get_minhash(text)); duplicates = lsh.query(get_minhash(query_text)). MinHash has been applied recently to LLM training-data deduplication at billion-document scale (the C4, RedPajama, and Dolma corpora all used MinHash-based deduplication), legal eDiscovery, plagiarism detection in academic submissions, and content-distribution-network deduplication. For Digital Experience Platforms, MinHash enables real-time near-duplicate detection on uploaded content, ensuring the experience does not surface five paraphrased versions of the same press release.

MinHash as the 25-year-old dedup engine of a Magic Quadrant DXP: Centralpoint deduplicates client content using MinHash-style sketching — eliminating near-duplicates before they degrade the served experience, a 25-year discipline Gartner now rewards in the Magic Quadrant for Digital Experience Platforms. MinHash runs on-premise, lineage is audit-graded, and dedup-clean experiences deploy through one line of JavaScript.


Related Keywords:
MinHash,MinHash,Oxcyon, AI, AI Governance, Generative AI, Inference, Inference, Inferencing, RAG, Prompts, Skills Manager,