Using Natural Language Processing to Classify Administrative Data of Purchased Products
64th ISI World Statistics Congress - Ottawa, Canada
Format: CPS Paper
Keywords: data science, learning, machine, statistics-based-on-data-science, supervised learning
Session: CPS 71 - Aspects of official statistics IV
Wednesday 19 July 8:30 a.m. - 9:40 a.m. (Canada/Eastern)
In cooperation with the National Tax and Customs Administration and the Institute of Agricultural Economics Non-profit Ltd., the Hungarian Central Statistical Office is currently developing new statistical processes to increase the quality of the household consumption expenditure estimates. In this innovative project multiple data sources are used, most importantly data from the Online Cash Registers(OCR) and the Online Invoice System(OIS). Based on these data sources, we aim at building a Machine Learning based methodology to identify the appropriate COICOP and CN categories.
Compared to scanner data our sources do not include bar code identifiers and therefore our current experiments heavily rely on using Natural Language Processing(NLP) techniques on the item names recorded in the purchases. Furthermore, these item names are given by the retailer itself and many times, especially in the OCR, are just abbreviations of the product's name which makes even manual coding difficult. In the literature of NLP the usual approach for short texts are related to twitter data and their processing where using word vectors and TF-IDF representations of the text are the most common. Our data, however, make an exception again as they are tend to be even shorter (our texts are about 5-50 character long) and usually do not have the coherence of a sentence. Thereof, we choose the simplest approaches to preprocess the texts, e.g. One-Hot-Encoding, and frequency based Bag-of-Words.
During our research we experimented with several ML techniques so far to see what could bring an optimal solution in creating the appropriate depth and precision for the classifications. Experiments with Unsupervised models, especially Latent Dirichlet Allocation, show that it cannot provide the necessary depth nor precision for a general solution although it preforms reasonably well for some categories when the text data is detailed enough. However, supervised learning provides a set of tools that performed surprising well on our test samples even though the train sets were small and unbalanced across categories. The three best performing model types are logistic regressions, random forests and multi layer perceptrons. The document will elaborate in details on the mentioned methods and results of the research process.