IPS 421 - Data Science in Statistics: methodological and applied issues

Category: IPS

Thursday 20 July 10 a.m. - noon (Canada/Eastern) (Expired) Room 105

Data science has a great and increasing importance in several branches of statistics using large data sets and new data sources, e.g., administrative registers, satellites and aircrafts, webcams, data voluntarily provided by internet users, data harvested from the web and so on. The analysis and elaboration of these kinds of data require the use of data science methods and tools besides “traditional” statistical methods. The applications of data science tools range from earth observation to official statistics, and the discussion on advantages, disadvantages, limitations, and requirements of the use of alternative data sources integrated with probability sample surveys is informing the debate in national and international statistical systems all over the world.

This Invited Paper Session (IPS) focuses on most relevant methodological and applied issues of data science: interpretability of machine learning tools, potential bias, integration of new data sources with sample surveys for improving official statistics, analysis of huge amounts of meteorological and remote sensing data.

This IPS is proposed by the vice-chair and chair-elect of the ISI Special Interest Group on Data Science, discusses methodological and applied issues, and is balanced from geographical and gender point of view.

Tree-based statistical learning techniques and explicative tools

Speaker: Rosanna Verde, Professor of Statistics - Università della Campania "Luigi Vanvitelli", Italy

Abstract. Machine Learning tools are very popular in the field of supervised classification when the number of observations and the number of variables is too large to predict a priori classes. However, there is a strong automatism in the classification process which represents a challenge of the widely consolidated techniques. An interesting contribution could certainly be to provide interpretative and descriptive tools, which in addition to the accuracy of the prediction, allow us to understand the discriminating power of the selected descriptors as the most competing in the construction of trees. For this reason, a criterion of recognition of the predictors that most contribute to the separation of the a priori groups, should be combined with an embedding procedure that seeks multiple solutions and a final compromise. Aids to the interpretation of the tree-based functional classifiers is still an open frontier. Some contributions are advanced in the choice of the best transformation of functional data to grasp the differences between the classes to be predicted in terms of slope or changing rates. Applications on real data, in the medical and environmental fields, allow to validate the proposals, related to the interpretative tools in the classification methods based on trees.

Mining Text for Bias in Written Comments of Student Evaluations of Teaching

Speaker: Daniel Jeske, University of California, Riverside (USA)

Philip Kass, University of California, Davis

Herbie Lee, University of California, Santa Cruz

Dylan Friel, University of California, Riverside

Yunzhe Li, Univiersity of California, Santa Curz

Abstract. We discuss alternative predictive models that efficiently scan written course comments and determine the proportions of comments that reflect student satisfaction levels that are positive, mixed, or negative. We use the predictive model to investigate the degree of potential bias in written comments with respect to the gender, ethnicity, and rank of the instructor, and compare the findings to parallel bias studies of the corresponding numerical scores.

Evolving Official Statistics: The Increasingly Varied Role of Data Science

Speaker: Linda J. Young, Chief Mathematical Statistician and Director Research and Development Division, USDA NASS

Abstract. Sample surveys have been the foundation of official statistics produced by the US Department of Agriculture’s National Agricultural Statistics Service (NASS) and other National Statistical Institutes for more than half a century. Increasingly, information from diverse sources, such as administrative, weather, and remotely sensed data, is available and can be used to improve fully survey-based estimates. In addition, new products that inform official statistics can be developed, such as new metrics or maps of the scope and intensity of natural disasters. In this presentation, data science approaches that are being used in the production of official statistics are highlighted. Estimates of the propensity of response from a sampled unit have been incorporated in the sampling and data collection phases of surveys. Predictions of what crops will be grown where can inform editing processes. Survey and non-survey data have been combined through modeling to produce improved official statistics. The progress that has been made and important research questions that remain are discussed.

Spatio-temporal modelling of the Brazilian wildfires: The influence of human and meteorological variables

Speaker: Paulo Canas Rodrigues, Department of Statistics, Federal University of Bahia, Salvador, BA, Brazil

Abstract: Wildfires are one of the most common natural disasters in many world regions and actively impact life quality. These events have become frequent with the increasing effect of climate change and other local policies and human behaviour. This study considers the historical data with the geographical locations of all the ``fire spots'' detected by the reference satellites that cover the whole Brazilian territory between January 2011 and December 2020, comprising more than 1.8 million fire spots. This data was modelled with a spatial econometric model using meteorological variables (precipitation, air temperature, humidity, and wind speed) and a human variable (land-use transition and occupation) as covariates. We find that the change in land use from forest and green areas to farming has a significant positive impact on the number of fire spots for all six Brazilian biomes. (Joint work with Jonatha Pimentel and Rodrigo Bulhões)

Statistical Modelling alternatives to Machine Learning in complex survey data analysis

Speaker: Ross Darnell, Data61 CSIRO

Murray Aitkin, School of Mathematics and Statistics, University of Melbourne

Discussant: Elisabetta Carfagna, University of Bologna, Department of Statistical Sciences, Italy

Organiser: Prof. Elisabetta Carfagna

Chair: Prof. Elisabetta Carfagna

Speaker: Dr Paulo Canas Rodrigues

Speaker: Prof. Rosanna Verde

Speaker: Dr Daniel Jeske

Speaker: Ross Darnell

Speaker: Linda Young

Discussant: Elisabetta Carfagna