Using Active Learning to Improve Quality of Machine Learning Models for the Canadian Census
64th ISI World Statistics Congress - Ottawa, Canada
Format: CPS Abstract
Keywords: machine learning, nlp
Session: CPS 29 - Census II
Monday 17 July 5:30 p.m. - 6:30 p.m. (Canada/Eastern)
In 2021 Statistics Canada used a natural language processing algorithm to code a significant portion of its Census data. Like any supervised machine learning algorithm, this required a large amount of labeled data. However, given the five-year Census cycle at Statistics Canada, only a small portion of the previously labeled data from the previous Census provided valid labels, leaving a large amount of unlabelled data which was no longer suitable for model training. This arrived as some variables underwent significant changes to the approved label set between census cycles.
Prior to production, we were faced with a decision; train a model using only the records that remained labelled or attempt to relabel the now unlabelled data to fit the new set of approved labels. This task of relabelling data is extremely labour intensive and cost prohibitive, however a model trained only on the small portion of data with labels would ultimately perform poorly in comparison to a model trained on all the data. Due to time constraints and resource limitations, for most of our variables in 2021, the simpler model using less data was implemented.
For the next Census in 2026, we endeavour to find a better solution that will increase model quality while respecting the fact that relabelling data comes at a cost. To this end, we investigate the use of active learning, an iterative method which begins with a simple model and intelligently selects records to label based on the results of that model. These records are then labelled and then added to the model and the process repeats. The goal is to select the records to relabel which will achieve the greatest increase in quality per record labelled. We consider different methods of selection as well as number of records selected per iteration for two of our Census variables. This presentation will discuss the situations in which Statistics Canada stands to gain from active learning, the methods evaluated, and our results.