64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada

Deep Learning for Mexican Industry and Ocupation coding

Author

JP
Jael Perez

Co-author

  • A
    Alejandro Ruíz
  • A
    Alejandro Pimentel

Conference

64th ISI World Statistics Congress - Ottawa, Canada

Format: IPS Abstract

Session: IPS 200 - Challenges of Natural Language Processing techniques in official statistics

Tuesday 18 July 10 a.m. - noon (Canada/Eastern)

Abstract

One method that is commonly used in an attempt to partially automate coding is to implement rule schemes that result in the aforementioned codes; that is, if a text contains words or terms associated with an occupation according to a predefined dictionary, then that code is assigned to the text. Automatic methods are able to reduce costs, but full coding has not been achieved. In partial coding, simple responses are coded automatically, while difficult responses are coded manually. A confidence metric on the part of the automatic system is used to distinguish between the two (Gweon et al. 2017b, 2017a).

Given the frequency, volume and use of resources required for these tasks, statistical offices,
including the mexican institute of statistics (INEGI), develop and improve automatic and
manual coding processes. Recently, and based on the technological and methodological
innovations, artificial Intelligence (AI) algorithms, statistical offices have designed processes
that include these advances, allowing greater automation. The relevance, not only in saving
resources but also in maintaining quality in coding processes, has led INEGI to continue with
a research on the subject and to develop a comprehensive operational proposal that that can
be adapted to current production processes.

In this document we present new developments, proposals, and analysis for the
implementation in production processes of AI algorithms. The quality in the obtained coding
is superior to that achieved so far, this result was achieved thanks to the implementation of
advances in three key components:

• Design and deployment of Natural Language Processing algorithms of AI algorithms
• More efficient use of the certainty metric.
• Evaluation by experts of the task.

The document is divided into 9 sections.
In Section 2 we outline the coding catalogs that are used for the classification of industries
and occupations. Section 3 is intended for related work, showing the strategies followed by
statistical offices in other countries. In Section 4 we explain the coding process currently in
place at the Mexican Statistical Institute, which we intend to improve with the proposals
described throughout this article. In Section 5, we introduce the new coding framework,
which integrates the additional stage corresponding to the DL algorithm and present its
overall results. In Section 6 we design and evaluate two strategies regarding the application
of the certainty metric as a mechanism to filter and select records to be coded by the DL
algorithm. In Section 7 we attempt to estimate the savings implications that these new
methodologies will have for the institute. . In Section 8 we present an evaluation exercise
carried out by experts of the field on those records where the code assigned by the DL
algorithm does not match with the one assigned by the hired personnel, this is done in such
a way that it helps us understand the quality differences between the two coding processes.
In Section 9 we conclude.