64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada

IPS 287 - Sample surveys in the era of Big Data and Machine Learning

Category: IPS
Monday 17 July 10 a.m. - noon (Canada/Eastern) (Expired) Room 209

View proposal detail

New data sources and data science tools are changing deeply the way in which statistical knowledge is being produced. This session aims at putting together examples of cutting edge theoretical and applied research in which the role of probability samples is highlighted as the cornerstone for successful statistical data analysis and/or integration. The session is structured in four talks and one discussant. In all talks, a (possibly small) probability sample is used to adjust the bias coming from self-selected samples or Big data sources and/or to effectively extract information from extremely Big and structured data. The talk by Maria del Mar Rueda (University of Granada, Spain) provides an up-to-date treatment of inference from non-probability samples also making use of machine learning methods and that by Paolo Righi (Italian National Statistical Office) provides an application of these methods that lead to the production of Experimental Official Statistics in Italy. The talk of Jae-Kwang Kim (Iowa State University, US) looks at data integration that can deal with a missing not at random situation using information from a small representative sample, while the contribution of Li Chun Zhang (Statistics Norway, University of Southampton) shows how sampling is necessary to obtain information on parameters defined on extremely large, complex and dynamic network structures. Andrea Diniz da Silva (Instituto Brasileiro de Geografia e Estatistica) who has recently overviewed a consultation to National Statistical Offices in Latin America and the Caribbean on the use of Big Data will provide a discussion.

Estimation in nonprobability samples with Propensity Score Adjustment and Kernel Weighting
Maria del Mar Rueda, University of Granada, Spain

Nonprobability samples usually entail selection biases that may arise from substantial differences between the potentially covered population and the target population. In this work, we compare two design-based methods: Propensity Score Adjustment, which estimates the propensities using predictive models such as logistic regression or machine learning classifiers, and Kernel Weighting, which combines the estimation of propensities with sample matching. In addition, we also consider the use of weight smoothing to account for the estimation in multipurpose surveys, where the covariates used in the adjustments may be more suitable for some target variables than for others.

New Data sources for improving Official Statistics
Paolo Righi, Istat

The talk focuses on the Data Integration approach, which combines multiple sources (surveys with probabilistic samples, administrative data and Big Data) and considers two classes of estimators. The first class considers design-based estimators and uses Big Data as auxiliary information, the second class uses the probabilistic sample as a source auxiliary information and its estimators make model-based inference using Big Data. In the latter case, the probabilistic sample is useful for dealing with the selection bias of the non-probabilistic sample and for correcting the measurement error when the Big Data does not collect the target variable accurately. The two classes of estimators are applied on real survey and Big data. 

Propensity score weighting for handling selection bias in voluntary samples
Jae Kwang Kim, Iowa State University

Propensity score weighting is widely used to improve the representativeness and to correct the selection bias in the sample. In this talk, we consider an alternative approach of estimating the inverse of the propensity scores using the density ratio function. The smoothed density ratio function is obtained by the solution to the information projection onto the space satisfying the moment conditions on the balancing scores. The proposed approach is applicable to nonignorable selection model with some identifiability conditions. 

Sampling for network function learning
Li-Chun Zhang

To define what we call network functions, let us first envisage a valued graph, where the nodes represent the units and the edges the connections among them, and both the nodes and the edges may be associated with values in addition. Any network function for a given unit must then be defined in terms of both the corresponding node and the nodes connected to it, as well as the associated values. A basic difficulty for learning such network functions arises when the edges of the graph are unknown to start with, even when the entire set of nodes are known, such that the edges can only be partly observed by sampling from the collection of nodes and edges, i.e. the graph. In this talk, we consider the feasibility of graph sampling approach to network function learning, as well as the corresponding learning methods based on sample graphs.

Organiser: Dr Maria Giovanna Ranalli

Chair: Dr Maria Giovanna Ranalli

Speaker: Ramón Ferri  

Speaker: Dr Paolo Righi

Speaker: Dr Jae Kwang Kim

Speaker: Li Chun Zhang

Discussant: Dr Andrea Diniz da Silva

Good to know

This conference is currently not open for registrations or submissions.