A Comparison of Machine Learning methods for survival prediction
64th ISI World Statistics Congress - Ottawa, Canada
Format: CPS Abstract
Keywords: machine learning, simulation, survival analysis
Session: CPS 22 - Survival statistics
Monday 17 July 4 p.m. - 5:25 p.m. (Canada/Eastern)
The Cox proportional hazards (Cox PH) regression model has been to date the most widely used regression model for the analysis of survival data. However, in recent times, machine learning methods have gained popularity and are being increasingly adapted for use with survival data. There has been little research to uncover situations where machine learning methods are preferable to the Cox PH model. This study evaluates the performance of three classes of machine learning methods, namely, tree-based, neural network-based and penalized regression models with that of the Cox PH model in a Monte Carlo simulation study. The methods are Deepsurv, random survival forest, XGBoost, Cox PH with L1 (ridge), L2 (LASSO) and both L1 and L2 (elastic-net) penalization. The methods are compared under a variety of experimental conditions that occur in practice. These conditions are generated by varying the (i) censoring proportion (c = 0.3, 0.8) or number of events-per-variable (10) (ii) level of collinearity among covariates (r = 0, 0.4, 0.8) and (iii) type of risk function (linear or non-linear) and taking various combinations of these simulation parameters. Furthermore, the methods are compared under model misspecification (omission of interaction effect and inclusion of noise variable) or measurement error in covariates. A test-training split approach is used and optimal hyper-parameters are estimated via cross-validation. Performance on test set is measured by the C-index. Results indicate that when the true risk function is linear, Cox PH model as well as penalized Cox PH regression models perform better than neural network methods and tree-based methods regardless of the censoring proportion. Cox PH also performs best under model misspecification and measurement errors in covariates. In case of non-linear risk function, the random survival forest outperforms all other methods for risk predictions even when the covariates are incorrectly specified or measured with error irrespective of values for the EPV or correlation parameters.