Preview

Medical Doctor and Information Technologies

Advanced search

Errors in real-world data: a review

https://doi.org/10.25881/18110193_2024_1_28

Abstract

There is increasing interest in using big data of real clinical practice to develop artificial intelligence systems for diagnostic and predictive models of diseases and conditions. At the same time, the quality of this data is usually low due to errors during input, suboptimal architecture of information systems, lack of standardization, etc. The review examines criteria for the reliability of real-world data, the most common problems, and ways to eliminate them: assessing the compliance of the data set with the design of the model being developed, identifying, and removing duplicate records in data sets, handling missing values, detecting, and handling outliers, identifying and handling inconsistencies in data. We conclude that further development of methods for creating data sets based on real-world data is required in terms of improving their quality, can lead to lower quality of the created machine learning models for diagnosis and prognosis

About the Authors

N. A. Ermakova
Pirogov Russian National Research Medical University
Russian Federation


A. V. Gusev
Federal Research Institute for Health Organization and Informatics
Russian Federation

PhD



O. Yu. Rebrova
Pirogov Russian National Research Medical University
Russian Federation

DSc



References

1. -Goldina TA, Kolbin AS, Belousov DYu, Borovskaya VG. Review of real-world data study. Kachestvennaya Klinicheskaya Praktika. 2021; 1: 56-63. (In Russ.) doi: 10.37489/2588-0519-2021-1-56-63.

2. -Solodovnikov AG, Sorokina EYu, Goldina TA. Real-world data: from planning to analysis. Medical Technologies. Assessment and Choice. 2020; 41(3): 9-16. (In Russ.) doi: 10.17116/medtech2020410319.

3. Maissenhaelter BE, Woolmore AL, Schlag PM. Real-world evidence research based on big data: Motivation-challenges-success factors. Onkologe (Berl). 2018; 24(S2): 91-98. doi: 10.1007/s00761-018-0358-3.

4. -Gusev AV, Zingerman BV, Tyufilin DS, Zinchenko VV. Electronic medical records as a source of real-world clinical data. Real-World Data & Evidence. 2022; 2(2): 8-20. (In Russ.) doi: 10.37489 /2782-3784-myrwd-13.

5. -Goldina TA, Suvorov NI. Real-World Data Studies: from Data to Health Technology Assessment and Decision-Making in Healthcare. Medical Technologies. Assessment and Choice. 2018; 1(31): 21-29. (In Russ.) doi: 10.37489 /2782-3784-myrwd-13.

6. -Grigoryev SG, Lobzin YuV, Skripchenko NV. The role and place of logistic regression and ROC analysis in solving medical diagnostic task. Jurnal infektologii. 2016; 4(8): 36-45. (In Russ.) doi: 10.22625/2072-6732-2016-8-4-36-45.

7. Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013; 20(1): 144-151. doi: 10.1136/amiajnl-2011-000681.

8. Cruz-Correia RJ, Rodrigues PP, Freitas A, et al. Data quality and integration issues in electronic health records. In book: Information Discovery on Electronic Health Records. Chapter: 4. Publisher: CRC PressEditors: Hristidis, Vagelis. 2009. Р.55-95. doi: 10.1201/9781420090413-c4.

9. van der Lei J. Use and abuse of computer-stored medical records. Methods Inf Med. 1991; 30(2): 79-80.

10. Mikkelsen G, Aasly J. Consequences of impaired data quality on information retrieval in electronic patient records. Int J Med Inform. 2005; 74(5): 387-394. doi: 10.1016/j.ijmedinf.2004.11.001.

11. Roukema J, Los RK, Bleeker SE, et al. Paper versus computer: feasibility of an electronic medical record in general pediatrics. Pediatrics. 2006; 117(1): 15-21. doi: 10.1542/peds.2004-2741.

12. Kaboli PJ, McClimon BJ, Hoth AB, Barnett MJ. Assessing the accuracy of computerized medication histories. Am J Manag Care. 2004; 10(11 Pt 2): 872-877.

13. Wallace CJ, Stansfield D, Gibb Ellis KA, Clemmer TP. Implementation of an electronic logbook for intensive care units. Proc AMIA Symp. 2002: 840-844.

14. Botsis T, Hartvigsen G, Chen F, Weng C. Secondary Use of EHR: Data Quality Issues and Informatics Opportunities. Summit Transl Bioinform. 2010; 2010: 1-5.

15. Wyatt JC, Liu JL. Basic concepts in medical informatics. J Epidemiol Community Health. 2002; 56(11): 808-812. doi: 10.1136/jech.56.11.808.

16. Cai L, Zhu Y. The Challenges of Data Quality and Data Quality. Assessment in the Big Data Era. Data Science Journal. 2015; 14(2): 1-10. doi: 10.5334/dsj-2015-002.

17. von Lucadou M, Ganslandt T, Prokosch HU, Toddenroth D. Feasibility analysis of conducting observational studies with the electronic health record. BMC Med Inform Decis Mak. 2019; 19(1): 202. doi: 10.1186/s12911-019-0939-0.

18. -Ionov MV, Bolgova EV, Zvartau NE, et al. Implementation of a clinical decision support system to improve the medical data quality for hypertensive patients. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2022; 22(1): 217-222 (In Russ.) doi: 10.17586/2226-1494-2022-22-1-217-222.

19. Pezoulas VC, Kourou KD, Kalatzis F, et al. Medical data quality assessment: On the development of an automated framework for medical data curation. Comput Biol Med. 2019; 107: 270-283. doi: 10.1016/j.compbiomed.2019.03.001.

20. Roche N, Reddel H, Martin R, et al. Quality standards for real-world research. Focus on observational database studies of comparative effectiveness. Ann Am Thorac Soc. 2014; 11(S2): S99-S104. doi: 10.1513/AnnalsATS.201309-300RM.

21. Collins R, MacMahon S. Reliable assessment of the effects of treatment on mortality and major morbidity, I: clinical trials. Lancet. 2001; 357(9253): 373-380. doi: 10.1016/S0140-6736(00)03651-5.

22. Takahashi Y, Nishida Y, Asai S. Utilization of health care databases for pharmacoepidemiology. Eur J Clin Pharmacol. 2012; 68(2): 123-129. doi: 10.1007/s00228-011-1088-2.

23. Han K, Song K, Choi BW. How to Develop, Validate, and Compare Clinical Prediction Models Involving Radiological Parameters: Study Design and Statistical Methods. Korean J Radiol. 2016; 17(3): 339-350. doi: 10.3348/kjr.2016.17.3.339.

24. Lee YH, Bang H, Kim DJ. How to Establish Clinical Prediction Models. Endocrinol Metab (Seoul). 2016; 31(1): 38-44. doi: 10.3803/EnM.2016.31.1.38.

25. Rahm E, Do H. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 2000; 23: 3-13.

26. Tamilselvi J, Gifta C. Handling Duplicate Data in Data Warehouse for Data Mining. International Journal of Computer Applications. 2011; 15.

27. Kristianson KJ, Ljunggren H, Gustafsson LL. Data extraction from a semi-structured electronic medical record system for outpatients: a model to facilitate the access and use of data for quality control and research. Health Informatics J. 2009; 15(4): 305-319. doi: 10.1177/1460458209345889.

28. McCoy AB, Wright A, Kahn MG, et al. Matching identifiers in electronic health records: implications for duplicate records and patient safety. BMJ Qual Saf. 2013; 22(3): 219-224. doi: 10.1136/bmjqs-2012-001419.

29. Little RJA, Rubin DB. Statistical analysis with missing data. John Wiley & Sons, Inc. Hoboken, New Jersey. 2002: 41-93. doi: 10.1002/9781119013563.fmatter.

30. Liu P, El-Darzi E, Lei L, et al. An analysis of missing data treatment methods and their application to health care dataset. In Advanced Data Mining and Applications: First International Conference, ADMA. Wuhan, China, July 22-24, 2005. Proceedings 1: 583-590.

31. Wang Z, Talburt JR, Wu N, et al. A Rule-Based Data Quality Assessment System for Electronic Health Record Data. Appl Clin Inform. 2020; 11(4): 622-634. doi: 10.1055/s-0040-1715567.

32. van der Heijden GJ, Donders AR, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006; 59(10): 1102-1109. doi: 10.1016/j.jclinepi.2006.01.015.

33. Liu W, Ding J. A novel complete-case analysis to determine statistical significance between treatments in an intention-to-treat population of randomized clinical trials involving missing data. Stat Methods Med Res. 2018; 27(4): 1067-1075. doi: 10.1177/0962280216651307.

34. Okpara C, Edokwe C, Ioannidis G, et al. The reporting and handling of missing data in longitudinal studies of older adults is suboptimal: a methodological survey of geriatric journals. BMC Med Res Methodol. 2022; 22(1): 122. doi: 10.1186/s12874-022-01605-w.

35. Wells BJ, Chagin KM, Nowacki AS, Kattan MW. Strategies for handling missing data in electronic health record derived data. EGEMS (Wash DC). 2013; 1(3): 1035. doi: 10.13063/2327-9214.1035.

36. Graham JW. Missing data analysis: making it work in the real world. Annu Rev Psychol. 2009; 60: 549-576. doi: 10.1146/annurev.psych.58.110405.085530.

37. Li J, Yan XS, Chaudhary D, et al. Imputation of missing values for electronic health record laboratory data. NPJ Digit Med. 2021; 4(1): 147. doi: 10.1038/s41746-021-00518-0.

38. Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychol Methods. 2002; 7(2): 147-177.

39. -Ryzhenkova K. Data omission recovery methods in statistical research. Intellekt. Innovatsii. Investitsii. 2012; 3: 127-133. (In Russ.)

40. Zhang Z. Missing data imputation: focusing on single imputation. Ann Transl Med. 2016; 4(1): 9. doi: 10.3978/j.issn.2305-5839.2015.12.38.

41. Nakai M, Ke W. Review of the methods for handling missing data in longitudinal data analysis. International Journal of Mathematical Analysis. 2011; 5(1): 1-13.

42. Harrell FE. Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Cham: Springer international publishing, 2015. 600 p.

43. Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis. Hoboken. N.J.: Wiley-Interscience, 2004. 506 p.

44. Powney M, Williamson P, Kirkham J, Kolamunnage-Dona R. A review of the handling of missing longitudinal outcome data in clinical trials. Trials. 2014; 15: 237. doi: 10.1186/1745-6215-15-237.

45. Wang H, Belitskaya-Levy I, Wu F, et al. A statistical quality assessment method for longitudinal observations in electronic health record data with an application to the VA million veteran program. BMC Med Inform Decis Mak. 2021; 21(1): 289. doi: 10.1186/s12911-021-01643-2.

46. Hegde H, Shimpi N, Panny A, et al. MICE vs PPCA: Missing data imputation in healthcare. Informatics in medicine unlocked. 2019; 17: 100275. doi: 10.1016/j.imu.2019.100275.

47. Sterne JA, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009; 338: b2393. doi: 10.1136/bmj.b2393.

48. Brinton DL, Ford DW, Martin RH, et al. Missing data methods for intensive care unit SOFA scores in electronic health records studies: results from a Monte Carlo simulation. J Comp Eff Res. 2022; 11(1): 47-56. doi: 10.2217/cer-2021-0079.

49. Jerez JM, Molina I, García-Laencina PJ, et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med. 2010; 50(2): 105-115. doi: 10.1016/j. artmed.2010.05.002.

50. Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res. 2011; 20(1): 40-49. doi: 10.1002/mpr.329.

51. Baneshi MR, Talei AR. Does the missing data imputation method affect the composition and performance of prognostic models? Iran Red Crescent Med J. 2012; 14(1): 31-36.

52. Burton A, Altman DG. Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. Br J Cancer. 2004; 91(1): 4-8. doi: 10.1038/sj.bjc.6601907.

53. Zhao F, Zhang C, Dong N, et al. A Uniform Framework for Anomaly Detection in Deep Neural Networks. Neural Process Lett. 2022; 54: 3467-3488. doi: 10.1007/s11063-022-10776-y.

54. Aggarwal CC. An introduction to outlier analysis. In: Outlier Analysis. Springer, Cham. 2017: 1-34. doi: 10.1007/978-3-319-47578-3_1.

55. Ienco D, Pensa RG, Meo R. A Semisupervised Approach to the Detection and Characterization of Outliers in Categorical Data. IEEE Trans Neural Netw Learn Syst. 2017; 28(5): 1017-1029. doi: 10.1109/TNNLS.2016.2526063.

56. Koufakou A, Ortiz E, Georgiopoulos M, et al. A Scalable and Efficient Outlier Detection Strategy for Categorical Data. 2007; 2: 210-217. doi: 10.1109/ICTAI.2007.125.

57. Suri NNRR, Murty MN, Athithan G. Detecting outliers in categorical data through rough clustering. Nat Comput. 2016; 15: 385-394. doi: 10.1007/s11047-015-9489-2.

58. Akande T, Kaur B, Dadkhah S, Ghorbani A. Threshold based Technique to Detect Anomalies using Log Files. ICMLT 2022: 2022 7th International Conference on Machine Learning Technologies. 2022: 191-198. doi: 10.1145/3529399.3529430.

59. Chen H, Zhang H, Liu C, et al. J Neural Eng. 2022; 19(5): 10.1088/1741-2552/ac954d. doi: 10.1088/1741-2552/ac954d.

60. Li X, Bagher-Ebadian H, Gardner S, et al. An uncertainty-aware deep learning architecture with outlier mitigation for prostate gland segmentation in radiotherapy treatment planning. Med Phys. 2023; 50(1): 311-322. doi: 10.1002/mp.15982.

61. -Zolotova TV, Volkova DA. Intelligent Data Processing Methods for the Atypical Values Correction of Stock Quotes. Statistics and Economics. 2022; 2(19): 4-13. (In Russ.) doi: 10.21686/2500-3925-2022-2-4-13.

62. Chandola V, Banerjee A, Kumar V. Anomaly detection: A Survey. ACM Comput. Surv. 2009; 41: 1-72. doi: 10.1145/1541880.1541882.

63. Grubbs FE. Sample criteria for testing outlying observations. Ann. Math. Statist. 21(1): 27-58. doi: 10.1214/aoms/1177729885.

64. -GOST R ISO 16269-4-2017 Statisticheskie metody. Statisticheskoe predstavlenie dannykh: Chast’ 4. Vyyavlenie i obrabotka vybrosov (Sistema standartov po informatsii, bibliotechnomu i izdatel’skomu delu) -Elektronnyi resurs. Elektron. fond pravovoi i normativ.-tekhn. inform. -cited 18.02.2024 Available from: https://docs.cntd.ru/document/1200146680. (In Russ.)

65. Tukey JW. Exploratory data analysis. Addison-Wesley publishing company. 1977. 716 p.

66. Hampel FR. The influence curve and its role in robust estimation // Journal of the American statistical association. 1974; 69(346): 383-393. doi: 10.2307/2285666.

67. Rousseeuw P, Hubert M. Robust statistics for outlier detection. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery. 2011; 1: 73-79. 10.1002/widm.2.

68. -Kuznetsov VV, Romanenko SV, Larin SL. Detection algorithm of a series of releases by Dikson criterion in inversion voltammetry. Analitika i kontrol`. 2014; 3(18): 310-315. (In Russ.)

69. Bansal V, Dorn C, Grunert M, et al. Outlier-based identification of copy number variations using targeted resequencing in a small cohort of patients with Tetralogy of Fallot. PLoS One. 2014; 9(1): e85375. doi: 10.1371/journal.pone.0085375.

70. Barkley D, Hatsis P, Glick J, et al. Dixon’s Q-test and Student’s t-test to assess analog internal standard response in nonregulated LC-MS/MS bioanalysis. Bioanalysis. 2020; 12(21): 1535-1543. doi: 10.4155/bio-2020-0207.

71. Marcks KL, Zhao Y, Motro M, Will LA. Cephalometric Variability Among Siblings: A Pilot Study. Turk J Orthod. 2022; 35(4): 239-247. doi: 10.5152/TurkJOrthod.2022.21237.

72. Estiri H, Murphy SN. Semi-supervised encoding for outlier detection in clinical observation data. Comput Methods Programs Biomed. 2019; 181: 104830. doi: 10.1016/j.cmpb.2019.01.002.

73. Estiri H, Klann JG, Murphy SN. A clustering approach for detecting implausible observation values in electronic health records data. BMC Med Inform Decis Mak. 2019; 19(1): 142. doi: 10.1186/s12911-019-0852-6.

74. Phan HTT, Borca F, Cable D, et al. Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort. Sci Rep. 2020; 10(1): 10164. doi: 10.1038/s41598-020-66925-7.

75. Knorr EM, Ng R. Algorithms for mining distance-based outliers in large datasets. VLDB ‘98: Proceedings of the 24rd International conference on very large data bases. 1998; 392-403.

76. Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record: proceedings of the 2000 ACM SIGMOD international conference on management of data. Dallas Texas: ACM. 2000; 29(2): 427-438. doi: 10.1145/335191.335437.

77. Breunig M, Kröger P, Ng R, Sander J. LOF: identifying density-based local outliers // ACM SIGMOD Record: proceedings of the 2000 ACM SIGMOD international conference on management of data. Dallas Texas: ACM. 2000; 29(2): 93-104. doi: 10.1145/335191.335388.

78. Kumaravel VP, Buiatti M, Parise E, Farella E. Adaptable and Robust EEG Bad Channel Detection Using Local Outlier Factor (LOF). Sensors (Basel). 2022; 22(19): 7314. doi: 10.3390/s22197314.

79. Karasmanoglou A, Antonakakis M, Zervakis M. ECG-Based Semi-Supervised Anomaly Detection for Early Detection and Monitoring of Epileptic Seizures. Int J Environ Res Public Health. 2023; 20(6): 5000. doi: 10.3390/ijerph20065000.

80. Fowler JW, Alpert BK, Joe YI, et al. A Robust Principal Component Analysis for Outlier Identification in Messy Microcalorimeter Data. J Low Temp Phys. 2019; 199(3-4): 10.1007/s10909-019-02248-w. doi: 10.1007/s10909-019-02248-w.

81. Ebrahimi S, Fleuret J, Klein M, et al. Robust Principal Component Thermography for Defect Detection in Composites. Sensors (Basel). 2021; 21(8): 2682. doi: 10.3390/s21082682.

82. Chen X, Zhang B, Wang T, et al. Robust principal component analysis for accurate outlier sample detection in RNA-Seq data. BMC Bioinformatics. 2020; 21(1): 269. doi: 10.1186/s12859-020-03608-0.

83. Hubert M, Rousseeuw P, Branden K. ROBPCA: A new approach to robust principal component analysis. Technometrics. 2005; 47: 64-79. doi: 10.1198/004017004000000563.

84. Hubert M, Rousseeuw P, Verdonck T. Robust PCA for skewed data and its outlier map. Computational Statistics & Data Analysis. 2009; 53: 2264-2274. doi: 10.1016/j.csda.2008.05.027.

85. Aggarwal CC. Linear models for outlier detection. In: Outlier analysis. Springer, Cham: Springer international publishing. 2017: 65-110. doi: 10.1007/978-3-319-47578-3_3.

86. Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics and intelligent laboratory systems. 1987; 2(1-3): 37-52. doi: 10.1016/0169-7439(87)80084-9.

87. Liu FT, Ting KM, Zhou Z. Isolation forest. 2008 Eighth IEEE international conference on data mining. 2009: 413-422. doi: 10.1109/ICDM.2008.17.

88. Li Z, Zhao Y, Botta N, et al. COPOD: Copula-Based Outlier Detection. International conference on data mining: 2020 IEEE International conference on data mining (ICDM), 2020: 1118-1123. doi: 10.1109/ICDM50108.2020.00135.

89. Bijlani N, Nilforooshan R, Kouchaki S. An Unsupervised Data-Driven Anomaly Detection Approach for Adverse Health Conditions in People Living With Dementia: Cohort Study. JMIR Aging. 2022; 5(3): e38211. doi: 10.2196/38211.

90. Pang G, Shen C, Hengel A. Deep anomaly detection with deviation networks // KDD ‘19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 353-362. doi: 10.1145/3292500.3330871.

91. Pang G, Shen C, Cao L, Hengel A. Deep learning for anomaly detection: a review. ACM Computing surveys. 2021; 54(2): 1-38. doi: 10.1145/3439950.

92. Garcia JB, Tanadini-Lang S, Andratschke N, et al. Suspicious Skin Lesion Detection in Wide-Field Body Images using Deep Learning Outlier Detection. Annu Int Conf IEEE Eng Med Biol Soc. 2022; 2022: 2928-2932. doi: 10.1109/EMBC48229.2022.9871655.

93. Реброва О.Ю. Статистический анализ медицинских данных. Применение пакета прикладных программ STATISTICA. – М.: МедиаСфера; 2006. -Rebrova OYu. Statisticheskii analiz meditsinskikh dannykh. Primenenie paketa prikladnykh programm STATISTICA. Moscow: Media Sfera; 2006. (In Russ.)

94. Aguinis H, Gottfredson RK, Joo H. Best-practice recommendations for defining, identifying, and handling outliers. Organizational Research Methods. 2013; 16(2): 270-301. doi: 10.1177/1094428112470848.

95. Brown PJ, Warmington V. Data quality probes-exploiting and improving the quality of electronic patient record data and patient care. Int J Med Inform. 2002; 68(1-3): 91-98. doi: 10.1016/s1386-5056(02)00068-0.

96. Carlson D, Wallace CJ, East TD, Morris AH. Verification & validation algorithms for data used in critical care decision support systems. Proc Annu Symp Comput Appl Med Care. 1995; 188-192.

97. -Bobrovskaya TM, Vasil’ev YUA, Nikitin NYU, Arzamasov KM. Podhody k formirovaniyu naborov dannyh v luchevoj diagnostike. Vrach i informacionnye tekhnologii. 2023; 4: 14-23. (In Russ.)

98. -Vasil’ev YUA, Bobrovskaya TM, Arzamasov KM, et al. Osnovopolagayushchie principy standartizacii i sistematizacii informacii o naborah dannyh dlya mashinnogo obucheniya v medicinskoj diagnostike. Menedzher zdravoohraneniya. 2023; 4: 28-41. (In Russ.)

99. -Nacional’nyj standart RF GOST R 59921.5-2022 «Sistemy iskusstvennogo intellekta v klinicheskoj medicine. CHast’ 5. Trebovaniya k strukture i poryadku primeneniya nabora dannyh dlya obucheniya i testirovaniya algoritmov». (In Russ.)


Review

For citations:


Ermakova N.A., Gusev A.V., Rebrova O.Yu. Errors in real-world data: a review. Medical Doctor and Information Technologies. 2024;(1):28-43. (In Russ.) https://doi.org/10.25881/18110193_2024_1_28

Views: 32


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1811-0193 (Print)
ISSN 2413-5208 (Online)