64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada

Modelling the Survival Status of Breast Cancer: A Machine Learning Approach.


Dr Serifat Adedamola Folorunso


  • R
    Richard Kehinde


64th ISI World Statistics Congress - Ottawa, Canada

Format: CPS Abstract

Keywords: high-dimensional, survival function

Session: CPS 22 - Survival statistics

Monday 17 July 4 p.m. - 5:25 p.m. (Canada/Eastern)


The application of machine learning in clinical trials and cohort studies cannot be underestimated because most data generated are featured with high-dimensional, censored, heterogeneous, and frequently missing information, posing difficulties for conventional statistical analysis. It is important to provide alternative techniques to model this complex data to circumvent these constraints.
This study applied Machine Learning techniques to Breast Cancer dataset. The METABRIC dataset contains 2,509 distinct breast cancer patients and the data was gotten from Kaggle.
This study aimed at building a machine learning model to predict the overall survival status, using forward lifting to handle the missing observations and visualize the survival probabilities of breast cancer patients over their months of diagnosis.
The present study used supervised learning tools which involves the Predictive models where Training data and Testing data were showcased and build a model for predicting breast cancer diseases survival status. As well as the Classification models that includes the Random Forest Classifier (RFC) and Logistics Regression Model
The diagnosed ages are between 21.9 years and 96.3 years and their mean diagnosis age is 60.4 years. A total of 2,506 patients have been diagnosed with breast cancer, while three (3) patients have been diagnosed with breast sarcoma.
The most prevalent histological subtype of the disease is the Invasive ductal carcinoma (IDC), which accounts for 1865 cases of breast cancer.
The model accuracy showed that Logistics Regression with 0.827864714790509 compared to the Random Forest Classifiers with 1.000000000; thus, RFC have the highest accuracy score, and therefore was used for prediction.


fig 2

fig 3



fig 8