Download PDF

Sample surveys in the era of Big Data and Machine Learning

Author

Maria del Mar Rueda

Co-author

Ramón Ferri
Beatriz Cobo
Jorge Rueda

Conference

64th ISI World Statistics Congress - Ottawa, Canada

Format: IPS Abstract

Keywords: kernel methods, propensity_score_adjustment, selection bias

Abstract

The rise of new survey methods and large volume datasets for the estimation of population parameters has led to an increase in the use of nonprobability samples. These samples usually entail selection biases that may arise from substantial differences between the potentially covered population and the target population. The methods proposed in literature to deal with selection biases can be grouped in design-based adjustments, which are based on estimating the unknown inclusion probabilities, and model-based adjustments, which are based on estimating the unknown values of the target variable in a probability sample drawn from the same population. The former methods might be more suitable when there are multiple target variables, which is often the case for official statistics. In this work, we compare two design-based methods: Propensity Score Adjustment, which estimates the inclusion probability using predictive models such as logistic regression or machine learning classifiers, and Kernel Weighting, which combines the estimation of inclusion probabilities with sample matching. In addition, we also consider the use of weight smoothing to account for the estimation in multipurpose surveys, where the covariates used in the adjustments may be more suitable for some target variables than for others.