Download PDF

A Comparison of Machine Learning methods for survival prediction

Author

Durjoy Dey

Co-author

Dr. Tamanna Howlader
Srizan Chowdhury

Conference

64th ISI World Statistics Congress - Ottawa, Canada

Format: CPS Abstract

Keywords: machine learning, simulation, survival analysis

Abstract

The Cox proportional hazards (Cox PH) regression model has been to date the most widely used regression model for the analysis of survival data. However, in recent times, machine learning methods have gained popularity and are being increasingly adapted for use with survival data. There has been little research to uncover situations where machine learning methods are preferable to the Cox PH model. This study evaluates the performance of three classes of machine learning methods, namely, tree-based, neural network-based and penalized regression models with that of the Cox PH model in a Monte Carlo simulation study. The methods are Deepsurv, random survival forest, XGBoost, Cox PH with L1 (ridge), L2 (LASSO) and both L1 and L2 (elastic-net) penalization. The methods are compared under a variety of experimental conditions that occur in practice. These conditions are generated by varying the (i) censoring proportion (c = 0.3, 0.8) or number of events-per-variable (10) (ii) level of collinearity among covariates (r = 0, 0.4, 0.8) and (iii) type of risk function (linear or non-linear) and taking various combinations of these simulation parameters. Furthermore, the methods are compared under model misspecification (omission of interaction effect and inclusion of noise variable) or measurement error in covariates. A test-training split approach is used and optimal hyper-parameters are estimated via cross-validation. Performance on test set is measured by the C-index. Results indicate that when the true risk function is linear, Cox PH model as well as penalized Cox PH regression models perform better than neural network methods and tree-based methods regardless of the censoring proportion. Cox PH also performs best under model misspecification and measurement errors in covariates. In case of non-linear risk function, the random survival forest outperforms all other methods for risk predictions even when the covariates are incorrectly specified or measured with error irrespective of values for the EPV or correlation parameters.