Multilevel categorization of continuous variables in the tasks of explaining predictive estimates of machine learning models in clinical medicine
https://doi.org/10.25881/18110193_2023_3_44
Abstract
Aim: Comparative assessment of the quality of predictive models of in-hospital mortality (IHM) in patients with ST-segment elevation myocardial infarction (STEMI) after percutaneous coronary artery intervention (PCI), developed on the basis of predictors in continuous, dichotomous and multilevel categorical forms.
Materials and methods: This was a single-center retrospective study analyzing data from 4677 medical records of patients with STEMI PCI who were treated at the Regional Vascular Center of Vladivostok. Two groups of patients were identified: the first consisted of 318 (6.8%) patients who died in hospital, the second — 4359 (93.2%) patients with a favorable treatment outcome. Predictive models of IHF with continuous variables were developed using multivariate logistic regression, random forest, and stochastic gradient boosting. Dichotomization of predictors was performed using grid search methods for optimal cutoff points, centroid calculation, and Shapley additive explanation (SHAP). It was proposed for multi-level categorization to use a combination of threshold values identified during dichotomization, as well as ranking cut-off thresholds using multivariate logistic regression weighting coefficients.
Results: Based on the results of a multistage analysis of indicators of the clinical and functional status of STEMI patients, new predictors of IHM were identified and validated, their categorization was performed, and prognostic models with continuous, dichotomous and multilevel categorical variables were developed (AUC: 0.885-0.902). Models whose predictors were identified using the multimetric categorization method were not inferior in accuracy to models with continuous variables and had higher quality metrics than algorithms with dichotomous predictors. The advantage of models with multilevel categorization of predictors was the ability to explain and clinically interpret the results of IHM prediction.
Conclusions: Multilevel categorization of predictors is a promising tool for explaining predictive scores in clinical medicine.
About the Authors
K. I. ShakhgeldyanRussian Federation
DSc, Associate Professor
Vladivostok
B. I. Geltser
Russian Federation
Corr. Member of the RAS, DSc, Professor
Vladivostok
N. S. Kuksin
Russian Federation
Vladivostok
I. G. Domzhalov
Russian Federation
Vladivostok
References
1. Mabikwa OV, Greenwood DC, Baxter PD, Fleming SJ. Assessing the reporting of categorised quantitative variables in observational epidemiological studies. BMC Health Serv Res. 2017; 17(1): 201. doi:10.1186/s12913-017-2137-z.
2. MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of dichotomization of quantitative variables. Psychol Methods. 2002; 7(1): 19-40. doi:10.1037/1082-989x.7.1.19.
3. Gupta R, Day CN, Tobin WO, Crowson CS. Understanding the effect of categorization of a continuous predictor with application to neuro-oncology. Neurooncol Pract. 2021; 9(2): 87-90. doi:10.1093/nop/npab049.
4. Geltser BI, Shakhgeldyan KI, Rublev VYu, Domzhalov IG, Tsivanyuk MM, Shekunova OI. Phenotyping of risk factors and prediction of inhospital mortality in patients with coronary artery disease after coronary artery bypass grafting based on explainable artificial intelligence methods. Russian Journal of Cardiology. 2023; 28(4): 5302. (In Russ.) doi:10.15829/1560-4071-2023-5302.
5. Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using «optimal» cutpoints in the evaluation of prognostic factors. J Natl Cancer Inst. 1994; 86(11): 829-835. doi:10.1093/jnci/86.11.829.
6. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement. BMC Med. 2015; 13: 1. doi:10.1186/s12916-014-0241-z.
7. Dawson NV, Weiss R. Dichotomizing continuous variables in statistical analysis: a practice to avoid. Med Decis Making. 2012; 32(2): 225-226. doi:10.1177/0272989X12437605.
8. Salis Z, Gallego B, Sainsbury A. Researchers in rheumatology should avoid categorization of continuous predictor variables. BMC Med Res Methodol. 2023; 23(1): 104. doi:10.1186/s12874-023-01926-4.
9. Altman DG, Royston P. The cost of dichotomising continuous variables. BMJ. 2006; 332(7549): 1080. doi:10.1136/bmj.332.7549.1080.
10. Austin PC, Brunner LJ. Inflation of the type I error rate when a continuous confounding variable is categorized in logistic regression analyses. Stat Med. 2004; 23(7): 1159-1178. doi:10.1002/sim.1687.
11. Streiner DL. Breaking up is hard to do: the heartbreak of dichotomizing continuous data. Can J Psychiatry. 2002; 47(3): 262-266. doi: 10.1177/070674370204700307.
12. Chen Y, Huang J, He X, et al. A novel approach to determine two optimal cut-points of a continuous predictor with a U-shaped relationship to hazard ratio in survival data: simulation and application. BMC Med Res Methodol. 2019; 19(1): 96. Published 2019. doi:10.1186/s12874-019-0738-4.
13. von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet. 2007; 370(9596): 1453-1457. doi:10.1016/S0140-6736(07)61602-X.
14. The World Health Organization, The top 10 causes of death [Internet]. 2019. Available from: https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death [cited 2023 Nov 30].
15. Ibánez B, James S, Agewall S, et al. 2017 ESC Guidelines for the management of acute myocardial infarction in patients presenting with ST-segment elevation. Rev Esp Cardiol (Engl Ed). 2017; 70(12): 1082. doi:10.1016/j.rec.2017.11.010.
16. Pfuntner A, Wier LM, Stocks C. Most Frequent Procedures Performed in U.S. Hospitals, 2011. In: Healthcare Cost and Utilization Project (HCUP) Statistical Briefs. Rockville (MD): Agency for Healthcare Research and Quality (US); October 2013.
17. Wang JJ, Fan Y, Zhu Y, et al. Biomarkers enhance the long-term predictive ability of the KAMIR risk score in Chinese patients with ST-elevation myocardial infarction. Chin Med J. 2019; 132(1): 30-41. doi:10.1097/CM9.0000000000000015.
18. Liu XJ, Wan ZF, Zhao N, et al. Adjustment of the GRACE score by HemoglobinA1c enables a more accurate prediction of long-term major adverse cardiac events in acute coronary syndrome without diabetes undergoing percutaneous coronary intervention. Cardiovasc Diabetol. 2015; 14: 110. doi: 10.1186/s12933-015-0274-4.
19. Chen X, Shao M, Zhang T, et al. Prognostic value of the combination of GRACE risk score and mean platelet volume to lymphocyte count ratio in patients with ST-segment elevation myocardial infarction after percutaneous coronary intervention. Exp Ther Med. 2020; 19(6): 3664-3674. doi: 10.3892/etm.2020.8626.
20. Wenzl FA, Kraler S, Ambler G, et al. Sex-specific evaluation and redevelopment of the GRACE score in non-ST-segment elevation acute coronary syndromes in populations from the UK and Switzerland: a multinational analysis with external cohort validation. Lancet. 2022; 400(10354): 744-756. doi: 10.1016/S0140-6736(22)01483-0.
21. Geltser BI, Shakhgeldyan KI, Domzhalov IG, et al. Prognosticheskaya ocenka kliniko-funkcional’nogo statusa pacientov s infarktom miokarda s pod»emom segmenta ST posle chreskozhnogo koronarnogo vmeshatel’stva. Certificate of registration of the database 2023622740, 10.08.2023. Request № 2023622516. 28.07.2023.
22. Valente F, Henriques J, Paredes S, et al. A new approach for interpretability and reliability in clinical risk prediction: Acute coronary syndrome scenario. Artif Intell Med. 2021; 117: 102113. doi: 10.1016/j.artmed.2021.102113.
23. Lundberg SM, Lee SI. A unified approach to interpreting model predictions, in: Advances in Neural Information Processing Systems. Proceedings of the 31st Annual Conference on Neural Information Processing Systems; 2017 Dec 04-09; Long Beach, USA. doi: 10.48550/arXiv.1705.07874.
24. Evenson KR, Wen F, Herring AH. Associations of Accelerometry-Assessed and Self-Reported Physical Activity and Sedentary Behavior With All-Cause and Cardiovascular Mortality Among US Adults. Am J Epidemiol. 2016; 184(9): 621-632. doi: 10.1093/aje/kww070.
25. Geltser BI, Shahgeldyan KI, Domzhalov IG, et al. Prediction of in-hospital mortality in patients with ST-segment elevation acute myocardial infarction after percutaneous coronary intervention. Russian Journal of Cardiology. 2023; 28(6): 5414. (In Russ.) doi: 10.15829/1560-4071-2023-5414.
26. Molnar C. Interpretable Machine Learning. A Guide for Making Black Box Models Explainable [Internet]. North Charleston: Independently published. 2023. [cited 2023 Nov 30]. Available from https://christophm.github.io/interpretable-ml-book.
27. Turner EL, Dobson JE, Pocock SJ. Categorisation of continuous risk factors in epidemiological publications: a survey of current practice. Epidemiol Perspect Innov. 2010; 7: 9. doi:10.1186/1742-5573-7-9.
Review
For citations:
Shakhgeldyan K.I., Geltser B.I., Kuksin N.S., Domzhalov I.G. Multilevel categorization of continuous variables in the tasks of explaining predictive estimates of machine learning models in clinical medicine. Medical Doctor and Information Technologies. 2023;(3):44-57. (In Russ.) https://doi.org/10.25881/18110193_2023_3_44