IPS 96 - Computing in the Modern Statistical Office

The statistical production process heavily relies on (statistical) computing throughout all process steps. Historically, ‘official’ software tools were developed and maintained by development units within IT departments, while much processing was automated ‘under the radar’ by users using for example the programmability of office tools and batch scripts. This user-driven automation took off further with the introduction of capable workstation statistic software packages such as SAS and SPSS and is now changing even more rapidly with the usage of powerful open-source such as R or Python.

Nowadays, these open-source tools are not limited to basic statistical tasks. They live in a rich ecosystem with professional software engineering features including libraries, (unit) testing and documentation standards, CI facilities, and so on. Current (open source) statistical tools have capabilities that go far beyond simple, stand-alone scripts and currently offer the ability to build services, web applications, or standalone applications that integrate with a modern IT infrastructure and thereby also can play an important role to make production systems more cost efficient.

As a consequence, decentralized script writing that has typically been permitted rather than sanctioned and supported in statistical organizations is rapidly transforming into professional software engineering. However, we still see that Statistical Organizations have to come to terms with this new reality. Decentralized software engineering means that allowing script writing is not enough: support is needed in the areas of HR management, IT infrastructure, and planning and organization of statistical production processes.

In this session, we cover the topic of computing in the statistical office from four different perspectives, each presented in a 10 minute presentation. The topic is introduced (5min) by the discussant and after the presentation a panel discussion is chaired by the discussant with input from the discussant and the audience.

Culture transformation and democratization of programming (Kate Burnett-Isaacs)

Statistics Canada is using data science and modern analytics as key methods to provide more value in the products and services it delivers to Canadians. Statistics Canada has been on a journey of adopting R and Python into its mainstream statistical processes since 2020. Fully integrating these open source technologies and associated modern methods into the day-to-day production pipelines of official statistics is more than just technology, it requires a cultural transformation. Statistics Canada approached open source adoption by providing the tools, support and resources to programmers from a variety of backgrounds and disciplines that extend beyond traditional roles. This cultural shift is being driven intentionally with intersecting pillars that include new governance models, partnerships and employee empowerment. This presentation outlines the critical role that democratizing programming has played in incorporating open source into production, the contributions of governance, partnerships, people and investment to effect this transformation, the progress achieved so far and the lessons learned along the way.

Computational competencies and skills of the modern statistician (Mark van der Loo)

Computing with data is at the heart of the statistical office. Yet, the area of technical computing often falls between the two stools of data analysts and IT developers. In this presentation I demonstrate the importance of various computational skills across the Generic Statistical Business Process model. As we will see, many of the skills are as specific to working with data as they are uncommon to typical IT development teams. However, I argue that there is also a common ground between the worlds of Computing with Data and IT development. This is the field of Software Engineering: the range of techniques and methods aimed at ensuring that software meets quality standards, including maintainability, usability, performance, and so on. I argue that Statistical Offices have a natural need for the role of ‘Research Software Engineer’ (RSE): a person who develops software, based on an understanding of the research goals. An RSE combines an understanding of data processing goals, methods, and technologies while being competent in software engineering, thereby filling the gap between the typical competence areas of statistical analysts and IT developers.

DevOps for the statistical production process? (Alexander Kowarik)

To modernize the statistical production within an organization producing official statistics, lessons can be learned from so-called DevOps practices in system development. There are several principles that could be easily translated from a traditional software development project, for example “Automate Everything”. Other principles could be adapted, for example “Testing” which could be adapted to include necessary steps for official statistics production such as “Observing Quality Dimensions”. Being completely reproducible is an important goal for official statistics, but this does not only imply requirements on code versioning, but also on data management so it can be accessed in a versioned manner. An implementation at Statistics Austria of a continuous integration / continuous development (CI/CD) pipeline based on R is presented. on implementation at Statistics.

Redesign of a statistical production process according to modern architecture principles (Mauro Bruno)

The rise of Big Data and, more in general, of new data sources (social media, satellite images, sensor data, etc.) is rapidly changing official statistics’ context. National Statistical Institutes (NSIs) are obliged to change the way they perform their core business, i.e., the production of relevant, timely and high-quality statistical outputs.

In traditional surveys, domain experts, researchers and IT experts work on tools and methods tailored to the specific statistical domain, therefore it is very difficult, if not impossible, to share code and methods. In order to remain relevant and attractive to young researchers NSIs should perform a ‘mind shift ’both at organizational and technological level. Domain experts, researchers and IT experts should start working closely on methods shared at NSIs or, even better, at international level. Further they should align to modern architecture principles that foster cross-domain interaction, e.g., github for code versioning, microservices to access algorithms through api, continuous integration and continuous delivery (CI/CD) tools to automate code release on production environment, cloud architecture for installation and deployment.

Within this context, a few years ago, Istat launched a set of “Experimental Statistics”, aiming at experimenting the use of new sources and the application of innovative methods in producing data. In this paper we focus on “Cosmopolitics” a new experimental statistic designed and implemented according to the architecture principles highlighted above.

The statistical production process heavily relies on (statistical) computing throughout all process steps. Historically, ‘official’ software tools were developed and maintained by development units within IT departments, while much processing was automated ‘under the radar’ by users using for example the programmability of office tools and batch scripts. This user-driven automation took off further with the introduction of capable workstation statistic software packages such as SAS and SPSS and is now changing even more rapidly with the usage of powerful open-source such as R or Python.

Nowadays, these open-source tools are not limited to basic statistical tasks. They live in a rich ecosystem with professional software engineering features including libraries, (unit) testing and documentation standards, CI facilities, and so on. Current (open source) statistical tools have capabilities that go far beyond simple, stand-alone scripts and currently offer the ability to build services, web applications, or standalone applications that integrate with a modern IT infrastructure and thereby also can play an important role to make production systems more cost efficient.
As a consequence, decentralized script writing that has typically been permitted rather than sanctioned and supported in statistical organizations is rapidly transforming into professional software engineering. However, we still see that Statistical Organizations have to come to terms with this new reality. Decentralized software engineering means that allowing script writing is not enough: support is needed in the areas of HR management, IT infrastructure, and planning and organization of statistical production processes.

In this session, we cover the topic of computing in the statistical office from four different perspectives, each presented in a 10 minute presentation. The topic is introduced (5min) by the discussant and after the presentation a panel discussion is chaired by the discussant with input from the discussant and the audience.

Organiser: Dr Alexander Kowarik

Chair: Dr Magdalena Six

Speaker: Kate Burnett-Isaacs

Speaker: Dr Mark van der Loo

Speaker: Dr Alexander Kowarik

Speaker: Dr Mauro Bruno

Discussant: Dr Anders Holmberg