Building a record linkage engine for socio-demographic data
64th ISI World Statistics Congress - Ottawa, Canada
Format: CPS Abstract
Keywords: administrative data, probabilisticlinkage, register
Session: CPS 71 - Aspects of official statistics IV
Wednesday 19 July 8:30 a.m. - 9:40 a.m. (Canada/Eastern)
Résil is the name of the French program aimed at building statistical registers of individuals and housing. Built by combining a set of administrative data sources, the registers will contain tens of millions of individuals.
Once the registers are built, it is planned to offer French official statisticians a data enrichment service. A record linkage engine is being developed as part of this service. When a file is submitted, the goal is to match each of its individuals to an individual from the register in order to add the desired information.
A record linkage process is very dependent on the data to be linked, therefore building a standardized record linkage engine for very diverse data sources (both in terms of available variables and data quality) raises a lot of challenges.
This paper goes though these challenges as well as the responses to them.
The adopted solution uses ElasticSearch, a search engine which is not originally designed for record linkage but nevertheless proves to perform well on this task, especially for very large datasets. The engine works alongside a pair labeling interface for clerical review to assess linkage quality and adapt model parameters accordingly.