Clustering for utility assessment of anonymized data
Conference
64th ISI World Statistics Congress - Ottawa, Canada
Format: CPS Abstract
Keywords: anonymization, cluster-validation, differential privacy
Abstract
Following a previous study (Ferrão et al., 2022b, 2022a) on clustering as auxiliary tool to identify groups of special interest, this work illustrates the application of the methodology to large datasets, increasing both cardinality and dimensionality. Several anonymized clustering scenarios are compared with the original data cluster solution. Differential privacy is applied as data anonymization process.
We propose a data utility assessment based on the agreement between the original data structure and the anonymized structures. Data utility is quantified by standard metrics, characteristics of the groups obtained, and other statistics.
We use partitional clustering and hierarchical clustering algorithms as gold standard. Several clustering validity indices are analyzed to understand to what extent the data structure is preserved.
This is a work in progress. Promising results have been obtained so far. The preliminary findings suggest that more records offer greater data structure resilience.
Ferrão, M. E., Prata, P., & Fazendeiro, P. (2022a). Anonymized higher education data for “Utility-driven assessment of anonymized data via clustering.” osf.io/9vgeh
Ferrão, M. E., Prata, P., & Fazendeiro, P. (2022b). Utility-driven assessment of anonymized data via clustering. Scientific Data, 9(456), 1–11. https://doi.org/10.1038/s41597-022-01561-6