64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada

Developing a comparison metric for survey question compliance: An case in utilizing open-source software and methodologies within the US Census Bureau


Dr Sheldon Waugh



64th ISI World Statistics Congress - Ottawa, Canada

Format: CPS Abstract

Keywords: machine learning

Session: CPS 38 - Survey statistics I

Tuesday 18 July 8:30 a.m. - 9:40 a.m. (Canada/Eastern)


The U.S. Census Bureau provides an extensive survey paradata collection, including the ability to record questions and interactions between the field representative/interviewers (FRs) and the survey respondent. Computer Audio-Recorded Interview (CARI) is a technique for recording portions of interviews. The CARI Interactive Data Access (CIDA) System provides a way of listening to these recordings and viewing these files. The Survey of Income and Program Participation (SIPP) is currently the only survey within CIDA with respondents' informed consent. Using Machine Learning and data engineering and development, open-source algorithms and novel methodologies can enable audio transcription, converting recorded questions and answers into actionable data. Providing an in-house technical solution within the federal government, without external contracts, maximizes government investment returns into Machine Learning Operations (MLOps).
The U.S. Census Bureau's Field Quality Monitoring (FQM) team in the Office of Survey and Census Analytics (OSCA) created a program to monitor FRs in near-real time working on multiple surveys. FRs were investigated if they were the prime actor in a collection area with anomalous data in any of a select set of metrics. A significant issue for FQM is the ability to confirm and track effective and correct dictation of survey questions. It is essential for FRs to properly dictate the survey questions as written to ensure clear, effective, and consistent comprehension among survey respondents, increasing consistency and providing illumination on poorly worded questions or requests.
We have developed a Machine learning pipeline utilizing Linux servers, Application Programming Interfaces (APIs), and open-source Natural Language Processing (NLP) models providing audio transcription and relational scoring to match transcribed audio with survey questions utilizing distance metrics and fuzzy matching to create a comparison metric. We developed an API with a REpresentational State Transfer (REST) architecture to house the NLP/Transcription model. This API houses the NLP models to transcribe the incoming audio files obtained from CIDA. The open-source model from the python package Transformers is a large NLP model fine-tuned on 960h of Librispeech on 16kHz sampled speech audio. The resultant transcribed text is compared to the corresponding survey question using Levenshtein distance. Levenshtein distance is a metric to measure how apart two sequences of words are. It measures the minimum number of edits needed to change a one-word sequence into another. These edits can be insertions, deletions, or substitutions.
This project summary details the progress and pitfalls of our attempt to produce a sustainable NLP pipeline using open-source programs and methodologies.
The goal will be to create a different environment within a contained server area and run the API so the converted arrays can be quickly transcribed and returned. This environment will be incorporated within OSCA's current server and data processing architecture. This pipeline represents a novel foray into developing a complete in-house effort to execute open-source machine learning and complete data automation. Additionally, the known increase in return on investment is noted through the ability for the ML pipeline to decrease cost through not needing to hire staff to listen and transcribe recordings.