64th ISI World Statistics Congress - Ottawa, Canada

Building a record linkage engine for socio-demographic data


Keywords: administrative data, probabilisticlinkage, register

Résil is the name of the French program aimed at building statistical registers of individuals and housing. Built by combining a set of administrative data sources, the registers will contain tens of millions of individuals.

Once the registers are built, it is planned to offer French official statisticians a data enrichment service. A record linkage engine is being developed as part of this service. When a file is submitted, the goal is to match each of its individuals to an individual from the register in order to add the desired information.

A record linkage process is very dependent on the data to be linked, therefore building a standardized record linkage engine for very diverse data sources (both in terms of available variables and data quality) raises a lot of challenges.

This paper goes though these challenges as well as the responses to them.

The adopted solution uses ElasticSearch, a search engine which is not originally designed for record linkage but nevertheless proves to perform well on this task, especially for very large datasets. The engine works alongside a pair labeling interface for clerical review to assess linkage quality and adapt model parameters accordingly.