64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada

Sample surveys in the era of Big Data and Machine Learning

Organiser

MR
Maria Giovanna Ranalli

Participants

  • MR
    Dr Maria Giovanna Ranalli
    (Chair)

  • PR
    Dr Paolo Righi
    (Presenter/Speaker)
  • New Data sources for improving Official Statistics

  • JK
    Dr Jae-kwang Kim
    (Presenter/Speaker)
  • Multiple bias calibration for valid statistical inference with selection bias

  • LZ
    Li Chun Zhang
    (Presenter/Speaker)
  • Sampling for network function learning

  • RF
    Dr Ramón Ferri García
    (Presenter/Speaker)
  • Estimation in nonprobability samples with Propensity Score Adjustment and Kernel Weighting

  • AD
    Dr Andrea Diniz da Silva
    (Discussant)

  • Category: International Association of Survey Statisticians (IASS)

    Abstract

    Estimation in nonprobability samples with Propensity Score Adjustment and Kernel Weighting
    Maria del Mar Rueda, University of Granada, Spain

    Nonprobability samples usually entail selection biases that may arise from substantial differences between the potentially covered population and the target population. In this work, we compare two design-based methods: Propensity Score Adjustment, which estimates the propensities using predictive models such as logistic regression or machine learning classifiers, and Kernel Weighting, which combines the estimation of propensities with sample matching. In addition, we also consider the use of weight smoothing to account for the estimation in multipurpose surveys, where the covariates used in the adjustments may be more suitable for some target variables than for others.

    New Data sources for improving Official Statistics
    Paolo Righi, Istat

    The talk focuses on the Data Integration approach, which combines multiple sources (surveys with probabilistic samples, administrative data and Big Data) and considers two classes of estimators. The first class considers design-based estimators and uses Big Data as auxiliary information, the second class uses the probabilistic sample as a source auxiliary information and its estimators make model-based inference using Big Data. In the latter case, the probabilistic sample is useful for dealing with the selection bias of the non-probabilistic sample and for correcting the measurement error when the Big Data does not collect the target variable accurately. The two classes of estimators are applied on real survey and Big data.

    Propensity score weighting for handling selection bias in voluntary samples
    Jae Kwang Kim, Iowa State University

    Propensity score weighting is widely used to improve the representativeness and to correct the selection bias in the sample. In this talk, we consider an alternative approach of estimating the inverse of the propensity scores using the density ratio function. The smoothed density ratio function is obtained by the solution to the information projection onto the space satisfying the moment conditions on the balancing scores. The proposed approach is applicable to nonignorable selection model with some identifiability conditions.

    Sampling for network function learning
    Li-Chun Zhang

    To define what we call network functions, let us first envisage a valued graph, where the nodes represent the units and the edges the connections among them, and both the nodes and the edges may be associated with values in addition. Any network function for a given unit must then be defined in terms of both the corresponding node and the nodes connected to it, as well as the associated values. A basic difficulty for learning such network functions arises when the edges of the graph are unknown to start with, even when the entire set of nodes are known, such that the edges can only be partly observed by sampling from the collection of nodes and edges, i.e. the graph. In this talk, we consider the feasibility of graph sampling approach to network function learning, as well as the corresponding learning methods based on sample graphs.