Leveraging Electronic Health Records and Artificial Intelligence for Automated Health Outcome Analysis
Manuel C. Cossio, MMed, MEng, HE-Xperts Consulting LLC, Zürich, Switzerland, and University of Barcelona, Spain; Ramiro E. Gilardino, MD, MSc, MHS, HE-Xperts Consulting LLC, Zürich, Switzerland
Introduction
The digitization of medical-related data has experienced substantial growth in recent years.1 An increasing number of diagnostic tools and medical devices are connected to patients’ records, transferring data to a network or cloud for monitoring and treatment purposes (Figure 1). Additionally, wearables and digital health apps that can record patient vital signs at home have added to the amount of data that can be transferred to databases.2 This large collection of medical data represents a valuable resource for finding patient populations that share a particular condition. Clustering techniques can be used to group patients within the same pathology and artificial intelligence (AI) algorithms, specifically machine learning (ML), can then be applied to automatically identify new patients.3 Upon the admission of new patients, diagnostic tests are conducted and the results are added to their electronic health records (EHRs). This enables the algorithm to analyze the available information and predict the probability of the patient exhibiting the condition.3,4 With this capability, it is possible to analyze hundreds of patients swiftly and efficiently across different medical institutions and create a patient care plan that is organized according to the severity of their condition. Additionally, the models can determine the most influential factors in a patient’s classification by analyzing the data, thereby facilitating the identification of crucial diagnostic testing to determine the patient’s disease severity.5,6 The implementation of these algorithms has the potential to speed up the patient care process, leading to shorter wait times and improved quality of care. Additionally, the automatic analysis of data may provide insights into other areas of healthcare, such as the level of integration between diagnostic units and treatment facilities.
Figure 1. Graphical representation of the process of building and training artificial intelligence algorithms, specifically machine learning, for automatic patient screening. 1, represents all medical data sources that can integrate EHRs, SDoHs; 2, 3, and 4 represent clinical processing: outcome and case-control selection; 5 and 6 represent the computational part of the process: feature extraction and model construction. Model construction includes the training and testing phase with selection of best hyperparameters; 7 represents the operational part of the trained model in the clinical environment.
How Are the Algorithms Trained?
To begin the process, the desired outcome of interest, such as renal disease, is selected. Access to a database is then obtained and all patients with the specified outcome are filtered while ensuring anonymization to safeguard private information (Figure 1). There are also pre-existing anonymized public databases, such as Clinical Practice Research Datalink (CPRD) , which contain data on more than 40 million patients. Next, data analysis is performed to clean the data and address any issues, such as missing values. Medical experts can manually choose the outcome-related variables or the model can autonomously determine the most relevant variables without expert input. Control subjects without the outcome are also selected. Finally, a suitable machine learning model, such as a support vector machine, is selected and the input variables are fed into it with the outcome as the label. The model parameters are iteratively adjusted until the best accuracy is achieved (Figure 1). The trained model is then exported and tested on unseen data.3,5–7
Origins of the Data
There are several data sources. EHR systems gather patient data at countrywide or regional level in public and private healthcare systems (eg, centralized EHR in some European countries).4 Databases can also be made from the systems due to their widespread implementation in all the hospitals across the regions. These databases offer a high level of anonymity to enable data use without endangering patient confidentiality. The SAIL databank in the United Kingdom,9 which has made it possible to create a wide range of ML applications using EHR data, serves as an example. Finally, we have closed EHR systems, such as those in the United States, that typically combine networks of hospitals run by the same company or private health insurers. To enable the creation of digital applications, these systems have also created their own databases, although given the number of patients they serve, they are typically smaller than others.10
Medical Specialties With Higher Development Rates
The leading medical specialties are cardiovascular, psychiatry, oncology, diabetes, and neurology. The most common conditions in those specialties are type 2 diabetes, suicide attempts, acute kidney injury, depression, and heart failure.5,6
"The utilization of artificial intelligence models, specifically those incorporating machine learning, has many effects on healthcare outcome assessment."
The primary objective behind the development of these applications was to determine which variables within the algorithm had the greatest impact on the outcome. Variable ranking, population screening programs, and automatic diagnostic tools were among the remaining motivations for the development of early detection systems. A promising application of these algorithms is predicting complications in patients undergoing surgical procedures.7
Structure and Data Types
There are 2 configurations when it comes to the structure. The first type of data is structured data, which includes predetermined categorical and continuous variables. The second type of data is unstructured data and comes in the form of medical notes.11 These notes must be processed using natural language programming techniques to extract variables that algorithms can use. These methods enable the extraction of data, such as medication or patient symptoms.5 Along with clinical data, the inclusion of social determinants of health (SDOH) in EHR is becoming increasingly important and plays a key role in the analysis of population-level findings.5,6,11
Codes for Data Classification
Two of the most widely utilized coding systems are the International Classification of Diseases (ICD) and Current Procedural Terminology (CPT) codes. The ICD codes play a critical role in defining the desired outcome and identifying comorbid conditions,12 while CPT codes provide information about a patient’s resource use.13 The significance of these codes lies in their ability to standardize data collection across international systems through their specificity.
Impact of the Models on Outcomes Analysis
The utilization of AI models, specifically those incorporating ML, has many effects on healthcare outcome assessment. The initial step in the development of these models is a thorough examination of data, which enables the identification of the most crucial variables relevant to outcomes.3 This prioritization streamlines decision making and reduces the financial burden on healthcare. Additionally, the data analysis can uncover variables with missing information and trigger a probe into the reason behind these gaps. One solution to this issue is to group patients with a high number of missing values and examine their SDoH. Research has demonstrated that there is a correlation between SDoH and outcomes, revealing that Black patients in the US healthcare system tend to receive substandard care compared to White patients.14 The models can also be trained to predict the likelihood of complications postsurgery, providing value in ensuring optimal medical care for patients and lowering the cost of managing postoperative complications.7 Furthermore, the models can be trained to identify common coexisting conditions linked to diseases. This recognition allows for the anticipation of complications during treatment and the implementation of effective preventive measures, leading to a delay in the onset of comorbidities, enhancing the patient’s quality of life, and having a positive impact on the healthcare system.
Promising Roles of Deployed Models
Models that are trained with EHR data also have a significant impact on clinical trials and the production of real-world evidence (RWE).15 Regarding the first aspect, models can assist in selecting crucial variables for screening trial participants. The role of SDoH is also crucial here, as it contributes to the inclusiveness of the process and ensures that all patients, regardless of their socioeconomic status, are equally considered. As for the second aspect, the automation of RWE generation is critical for real-time data analysis and the maintenance of updated database variables.16 In this regard, natural language processing (NLP) data extraction plays a significant role. A large portion of the information obtained from medical visits is in the form of free text, making it crucial to extract concise information that can be transformed into variable instances.5,17
"It is imperative to consider the privacy implications when working with a substantial amount of patient information."
Data Privacy and Identity Protection
It is imperative to consider the privacy implications when working with a substantial amount of patient information. Before embarking on a project to develop an AI model for patient data analysis, it must undergo review and evaluation by an ethics committee. The committee assesses the impact of the project and considers the cost-benefit tradeoff. Subsequently, the project and its prototype must be evaluated by data protection experts, who will consider questions such as the location of data storage, access permissions, and server locations. To further safeguard privacy, various techniques, such as homomorphic encryption, the interplanetary file system (IPFS), and blockchain technology, can be employed to anonymize patient data and maintain data integrity without compromising confidentiality.18 These privacy protection measures enhance patient trust and encourage the donation of data, ultimately advancing AI and ML applications in the healthcare field.
References
1. Tiffin N, George A, LeFevre AE. How to use relevant data for maximal benefit with minimal risk: digital health data governance to protect vulnerable populations in low-income and middle-income countries. BMJ Global Health. 2019;4(2):e001395.
2. Bent B, Wang K, Grzesiak E, et al. The digital biomarker discovery pipeline: an open-source software platform for the development of digital biomarkers using mHealth and wearables data. J Clin Transl Sci. 2021;5(1):e19.
3. Arbet J, Brokamp C, Meinzen-Derr J, Trinkley KE, Spratt HM. Lessons and tips for designing a machine learning study using EHR data. J Clin Transl Sci. 2020;5(1):e21.
4. Fragidis LL, Chatzoglou PD. Implementation of a nationwide electronic health record (EHR): the international experience in 13 countries. Int J Health Care Qual Assur. 2018;31(2):
116-130.
5. Cossio C, Gilardino R. RWD102 electronic health records with unstructured text to predict outcomes with machine learning: a therapeutic area fingerprint. Value Health. 2022;25(12):S468.
6. Cossio C, Gilardino R. RWD6 the use of machine learning in electronic health records disease analysis: an updated perspective. Value Health. 2022;25(7):S576-S577.
7. Bronsert M, Singh AB, Henderson WG, Hammermeister K, Meguid RA, Colborn KL. Identification of postoperative complications using electronic health record data and machine learning. Am J Surg. 2020;220(1):
114-119.
8. Cossio M. A perspective on the use of health digital twins in computational pathology. 2022. arXiv preprint arXiv:2212.00573.
9. Ford DV, JonesKH, VerplanckeJ-P, et al. The SAIL Databank: building a national architecture for e-health research and evaluation. BMC Health Serv Res. 2009;9:1-12.
10. Essén I, Scandurra R, Gerrits G, et al. Patient access to electronic health records: differences across ten countries. Health Policy Technol. 2018;7(1):44-56.
11. Wang Y, Ng K, Byrd RJ, et al. Early detection of heart failure with varying prediction windows by structured and unstructured data in electronic health records. Annu Int Conf IEEE Eng Med Biol Soc. 2015:2530-2533.
12. Germaine-Smith CS, Metcalfe A, Pringsheim T, et al. Recommendations for optimal ICD codes to study neurologic conditions: a systematic review. Neurology. 2012;79(10):1049-1055.
13. Simmons CG, Alvey NJ, Kaizer AM, et al. Benchmarking of anesthesia and surgical control times by current procedural terminology (CPT®) codes. J Med Syst. 2022;46(4):19.
14. Mayr FB, Yende S, D’Angelo G, et al. Do hospitals provide lower quality of care to black patients for pneumonia? Crit Care Med. 2010;38(3):759.
15. Estevez M, Benedum CM, Jiang C, et al. Considerations for the use of machine learning extracted real-world data to support evidence generation: a research-centric evaluation framework. Cancers. 2022;14(13):3063.
16. Silva-Tinoco R, Cuatecontzi-Xochitiotzi T, De la Torre-Saldaña V, et al. Influence of social determinants, diabetes knowledge, health behaviors, and glycemic control in type 2 diabetes: an analysis from real-world evidence. BMC Endocr Disord. 2020;20(1):1-11.
17. Datta S, Bernstam EV, Roberts K. A frame semantic overview of NLP-based information extraction for cancer-related HER notes. J Biomed Inform. 2019;100:103301.
18. Kumar R, Kumar J, Khan AA, et al. Blockchain and homomorphic encryption-based privacy-preserving model aggregation for medical images. Comput Med Imaging Graph. 2022;102:102139.

