64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada

Fuzzy matching on big-data: an illustration with scanner and crowd-sourced nutritional datasets

Author

LG
Lino Galiana

Co-author

  • M
    Milena Suarez Castillo

Conference

64th ISI World Statistics Congress - Ottawa, Canada

Format: CPS Abstract

Keywords: deep_learning, nlp, probabilisticlinkage

Session: CPS 72 - Statistics and health III

Wednesday 19 July 8:30 a.m. - 9:40 a.m. (Canada/Eastern)

Abstract

Food retailers' scanner data provide unprecedented details on local consumption, provided that product identifiers allow a linkage with features of interest, such as nutritional information.

In this paper, we enrich a large retailer dataset with nutritional information extracted from crowd-sourced and administrative nutritional datasets. To compensate for imperfect matching through the barcode, we develop a methodology to efficiently match short textual descriptions. After a preprocessing step to normalize short labels, we resort to fuzzy matching based on several tokenizers (including n-grams) by querying an ElasticSearch customized index and validate candidates echos as matches with a Levensthein edit-distance and an embedding-based similarity measure created from a siamese neural network model. The pipeline is composed of several steps successively relaxing constraints to find relevant matching candidates.

Figures/Tables

header

violin_step

barplot_siamese_coicop

wordcloud_relevanc_start

wordcloud_relevanc_clean