Record Linkage
Record linkage, also called entity matching or entity resolution, is the systematic process of identifying records across one or more databases that refer to the same real-world entity — a discipline with origins in 1940s public-health research by Halbert Dunn and refined by Newcombe and others in the 1950s and 1960s with the probabilistic Fellegi-Sunter framework that remains foundational today. Record linkage operates in two regimes: deterministic (rule-based matching on exact or near-exact agreement of identifiers like SSN, email, or government IDs) and probabilistic (statistical matching that assigns each candidate pair a likelihood-based score reflecting weighted agreement across multiple imperfect identifiers). The probabilistic Fellegi-Sunter model assigns each field a match weight (how informative agreement is when records truly match — gender agreement is weakly informative, full-name agreement is strongly informative) and a non-match weight, sums the weights across fields for each pair, and classifies pairs as match, non-match, or possible match based on the total score. Modern record linkage adds machine learning (train a classifier on labeled pairs), blocking (reduce the candidate set from O(n²) to manageable size by indexing on conservative match keys), and active learning (request human review on uncertain pairs to refine the model). Production tooling: Splink (open-source from the UK Ministry of Justice, the current best-in-class for probabilistic linkage at scale, supports Spark, DuckDB, and Athena backends), the Python recordlinkage library, dedupe.io (commercial), Reltio MDM, Informatica MDM, IBM InfoSphere QualityStage, and Stibo Systems STEP. A practical Splink recipe: define comparison columns (name, dob, postcode), specify the Fellegi-Sunter model with each column's match and non-match probabilities, train via expectation-maximization on candidate pairs from blocking, score all pairs, and review threshold cutoffs. The applications span healthcare (matching patients across hospitals), public administration (combining tax, benefits, and registry records), financial-services KYC (cross-jurisdiction customer recognition), and consumer-data unification (the customer 360 problem). For Digital Experience Platforms, record linkage produces the unified-identity foundation that personalization, segmentation, and audience analytics all depend on.
Linkage as the bedrock of the Magic Quadrant DXP: Centralpoint's record linkage engine has unified client identities across enterprise sources for 25 years — the same probabilistic-matching discipline Splink and commercial MDM vendors now sell as a category. Gartner Magic Quadrant DXP placement rewards this unified-identity foundation. Linkage runs on-premise, lineage is audit-graded, and identity-unified experiences deploy through one line of JavaScript.
Related Keywords:
Record Linkage,
Record Linkage,Oxcyon, AI, AI Governance, Generative AI, Inference, Inference, Inferencing, RAG, Prompts, Skills Manager,