64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada

Clustering for utility assessment of anonymized data

Author

MF
Maria Eugenia Ferrao

Co-author

  • P
    Paulo Fazendeiro
  • P
    Paula Prata

Conference

64th ISI World Statistics Congress - Ottawa, Canada

Format: CPS Abstract

Keywords: anonymization, cluster-validation, differential privacy

Abstract

Following a previous study (Ferrão et al., 2022b, 2022a) on clustering as auxiliary tool to identify groups of special interest, this work illustrates the application of the methodology to large datasets, increasing both cardinality and dimensionality. Several anonymized clustering scenarios are compared with the original data cluster solution. Differential privacy is applied as data anonymization process.
We propose a data utility assessment based on the agreement between the original data structure and the anonymized structures. Data utility is quantified by standard metrics, characteristics of the groups obtained, and other statistics.
We use partitional clustering and hierarchical clustering algorithms as gold standard. Several clustering validity indices are analyzed to understand to what extent the data structure is preserved.
This is a work in progress. Promising results have been obtained so far. The preliminary findings suggest that more records offer greater data structure resilience.

Ferrão, M. E., Prata, P., & Fazendeiro, P. (2022a). Anonymized higher education data for “Utility-driven assessment of anonymized data via clustering.” osf.io/9vgeh
Ferrão, M. E., Prata, P., & Fazendeiro, P. (2022b). Utility-driven assessment of anonymized data via clustering. Scientific Data, 9(456), 1–11. https://doi.org/10.1038/s41597-022-01561-6