64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada

Big Data and Data Science in the Colombian National Statistical Office – DANE

Author

LTO
Leonardo Trujillo Oyola

Co-author

Conference

64th ISI World Statistics Congress - Ottawa, Canada

Format: IPS Abstract

Keywords: "satellite-data", discrimination, informal_economy, machine learning, nighttime-light, nlp, official statistics

Session: IPS 85 - Big Data in National Statistical Offices in Latin America and the Caribbean

Tuesday 18 July 2 p.m. - 3:40 p.m. (Canada/Eastern)

Abstract

The recent experience of innovation in the Colombian National Statistical Office - DANE with Big Data and Data Science methods can be classified mainly in four areas: (i) application of machine learning techniques, (ii) automation of statistical processes in the GBSPM model, (iii) the use of natural language processing (NLP) methods and (iv) the use of night lights and satellite imagery.

In the first area, some Machine Learning techniques were applied at DANE in order to produce multidimensional poverty index (MPI) maps with more granularity. Traditionally, DANE has measured MPI at the department level through household surveys annually and at the municipality level using census data every 10 years. The goal is to measure MPI at the municipality level annually using a Bayesian generalized linear mixed geostatistical model with geospatial covariates such as nightime light consumption, vegetation index and accessibility via road to towns and cities resulting in interactive cartographic viewers. Also, during the pandemic, due to changes in the method of collection to telephonic surveys, it was impossible to measure informal employment rate for the labour force survey. That means, individuals who engage in productive activities that are not taxed or registered by the government. In order to impute a dummy informal employment variable for filling the gaps in the time series, a random forest classification model was considered using some administrative registers from the Colombian Ministry of Health.

Regarding the second area of study, a pilot of the automation of the processes in the GBSPM model of a whole survey has started with the Monthly Production Inquiry in order to reduce human error and repeated tasks in different areas. In the third area, the proportion of population reporting having personally felt discriminated against or harassed in the previous 12 months on the basis of a ground of discrimination prohibited under international human rights law for the Sustainable Development Goal 10 was calculated using public information from Facebook between June to October 2021 using NLP zero shot models in a probabilistic sample of Colombian users of Facebook with discriminatory comments. Also, DANE is currently doing another NLP analysis in order to analyse words, sentences and context from around 30,000 customer support queries received in 2022 and the results of these two exercises will be presented at the confererence. Regarding the fourth area, we measured changes in nighttime TOA radiance on a date before the COVID-19 outbreak and on a date during lockdown. Correlating these changes with shifts in the economic variables of the Monthly Production Inquiry, we could compare variables and data from Colombian industries and manufacturers and correlate them with changes in nighttime lights using an econometric model. In particular, we used night-time light datasets from NASA’s VNP46A1 sensor on 07/02/2020 and 27/04/2020.