64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada

Application of PCA and K-Means clustering to detect Autistic Spectrum Disorder


Mohsen Farid


  • S
    Sarwat Qureshi


64th ISI World Statistics Congress - Ottawa, Canada

Format: CPS Paper

Session: CPS 06 - Clustering

Monday 17 July 8:30 a.m. - 9:40 a.m. (Canada/Eastern)


Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder that manifests in problems related to communication, social skills, and repetitive and stereotypical behaviors. Caregivers, psychiatrists, and clinicians carry out the screening/diagnosis process using gold-standard tools. These tools are often criticized for being too lengthy and consuming a long time to analyze. There is a need to identify the most significant features of these tools to identify and detect autistic traits in the broader population accurately and effectively.

The existing screening tools, such as Quantitative Checklist for Autism in Toddlers (Q-CHAT) and Autism Quotient (AQ), heavily depend on the sum score of all items as a measure to evaluate the symptoms and severity of ASD among toddlers, adolescents, and adults. These handcrafted rules of the cut-off score for screening ASD are subjective and, therefore, open to debate. Therefore, improving the screening process for ASD and making it accessible to users becomes paramount.

This paper aims to detect ASD symptoms in toddlers using Principal Component Analysis (PCA) and K-means, both robustly and quickly using Q-CHAT. It further provides recommendations for determining the severity of ASD in toddlers.

A relatively large dataset employed in the study consists of 1016 children aged 16 to 36 months. The dataset contains four groups: (1) Toddlers who are typically developing, (2) Toddlers whose parents report ASD-specific concerns, (3) Toddlers at risk for autism due to having an older sibling with ASD, and (4) Toddlers with a developmental delay.

Q-CHAT consists of 25 questions (items) measuring different ASD traits, each of which requires a Likert scale answer from the set [0, ..., 4]. Half of the questions have reversed scale, i.e. [4, ..., 0].

PCA has identified a reduced dimensional space of questions in Q-CHAT that performs almost as well as the original questionnaire to detect ASD. PCA has effectively identified that there is no difference between groups (2), (3), and (4) using a reduced set of Q-CHAT items. Results also suggest no gender difference, as reported in some literature.

The K-Means clustering was employed to detect ASD/No-ASD from the dataset. The findings reveal that K-means has effectively (1) distinguished toddlers who are autistic from those who are not. The findings indicate that depending on only the sum score cut-off of all the questions in the original instrument is not the best way to identify toddlers with ASD.

This paper is one of a series of articles implementing Machine Learning and Deep Learning algorithms to detect Autistic traits.