Download PDF

Big Data applied for automatic coding of censuses and surveys

Author

Conference

64th ISI World Statistics Congress - Ottawa, Canada

Format: IPS Abstract

Keywords: artificial intelligence

Abstract

Artificial Intelligence provides a computer program the ability to let it think and learn on its own. It simulates human intelligence but applied into machines to do tasks that humans do, but with a significant reduction in errors and risks. Taking this into account, the following Coding System is capable of automate variables and assign a code to them based on descriptors, in order to improve decision making in future encodings and increase data processing capacity in order to accelerate the obtention of statistical data and thus, display the country’s reality within a short time-span.
To describe how the system works, it is necessary to keep in mind that it is based on previous coded phases, so that when a case is repeated, it is resolved in the same way as the previous one. A tool called Tabla Maestra is implemented, this table contains a list of previously codified descriptors, it also contains a list of words: synonyms, abbreviations, acronyms and misspellings that interact in the coding process.
To create the Tabla Maestra, conjunctions, articles, punctuation and quotation marks, dots or any other symbol other than letters are deleted. Then, the process of selection and coding of field descriptors such as: branch of economic activity, occupation, university degree, ethnicity and language starts, this data is obtained from the Census and Household Surveys. When all the data is obtained, the dictionaries are created. This process includes the selection of words with misspellings, abbreviations or acronyms as well as compound words and synonyms of descriptors what were obtained from field work.
The captured data is then coded by its lexemes, using one of the two methods: Equal or Similar. If all the words correspond with its coded counterpart, then it is Equal; but if the words are similar, use synonyms or the order of words is different, then the applied method is Similar.
The application of this system is based on Relational Database using Spark SQL and MLlib, which is a Machine Learning Library that is usable in Python to create learning algorithms such as classification, regression, clustering and collaborative filtering for data handling and statistics.
During the development of this system, prototypes were made, these included Relational Database designs in its first steps, as well as the comparison of the results forthcoming from this system versus the ones obtained through manual coding, in order to verify the accuracy of the coded information.
After the deployment and application of this Automatic Coding System, it is possible to acknowledge that manual coding time was reduced as well as the error rate in a significant manner, due to the fact that with manual coding, not all similar cases were resolved or catalogued in the same way, but thanks to the application of Artificial Intelligence, a standard was made. This standardized Tabla Maestra is also applicable to future surveys and Census that are coded in a similar manner.