Points of View Trends in medical record integration as seen in the case of COVID-19
The Office of Pharmaceutical Industry Research Norihiro Okada, Senior Researcher
In recent years, the pharmaceutical industry has been actively considering the use of medical records and other information obtained in actual clinical practice for research and development and epidemiological studies. When using such information, it is often desirable to integrate information collected at multiple medical institutions for the purpose of securing the number of cases to improve the accuracy of estimation and to improve the external validity of results, rather than conducting analysis using only information obtained at a single medical institution. On the other hand, providing patient information outside of a medical institution for integrated analysis is regulated by laws in various countries, such as the Personal Information Protection Act in Japan and the Health Insurance Portability and Accountability Act (HIPAA) in the United States. It is not easy to integrate patient-specific medical information across medical institutions. In response to this problem, the Next Generation Healthcare Infrastructure Act has been enacted in Japan, and measures are being taken to promote the use of medical information, such as the provision of personal information to authorized providers through an opt-out process. The fact that many barriers remain, such as the burden on medical institutions in notifying the person himself/herself and the cooperation among certified providers, can be seen in the discussions within the working group1) that has been held since the end of last year to study the Next Generation Medical Infrastructure Act. Although much attention has been focused on discussions regarding the development of legislation to resolve these issues, recently, various information integration approaches have been developed from a technological perspective to reduce the risk of personal information leaks while maintaining the benefits gained from the research and use of medical records, and are being considered for practical application. This paper describes the various approaches and their practical application. In this paper, we would like to discuss how medical information should be integrated, using the integration of electronic health records (EHR) centered on electronic medical records as a case study in the case of novel coronavirus infection (COVID-19), where various approaches have been implemented.
Methods of Integrating Medical Information
Studies on the use of databases that integrate information held by medical institutions for research are being conducted worldwide for a wide variety of diseases, including those led by governments in various countries and those aiming at cross-border integration. Integration of information is achieved mainly through the following three methods. (Figure 1)
-
1)Aggregation of Individual Case Information
Methods to make the per-patient information held by each medical institution accessible from a central integrated environment, or to consolidate replicated per-patient information into central storage and then perform the analysis in the integrated environment.
-
ii)Aggregation of information summarized at medical institutions
A method in which each medical institution summarizes personal information to a granularity that allows it to be provided externally for regulatory purposes, consolidates only the summarized information in central storage, and then performs the analysis in an integrated environment.
-
(iii)Sharing of analysis parameters performed at medical institutions
This method is used to build prediction and classification models by updating the parameters of the model at each medical institution using the information in its possession, consolidating the parameters in a central integrated environment, and returning the information obtained from the consolidated parameters to each medical institution to repeat the parameter updating process, A framework called federated learning or split learning has been devised to build a shared model coordinated using data from all medical institutions.
Since statistical processing commonly used basically assumes that all information is stored in a single storage location, the most suitable method of integrating information for analysis is the method (1), which consolidates information on individual cases in a single location. In most cases, this method falls under the category of providing personal information to a third party, and thus requires obtaining consent from the information provider or anonymizing the information to be provided. Considering only the analysis perspective, it is desirable to obtain consent from all patients. However, the current situation2) where there are requests to relax the requirements for opt-out notices required by the Next Generation Medical Infrastructure Act also places a heavy burden on medical institutions to obtain consent. In addition, it is necessary to consider the impact on analysis of patient selection bias caused by the presence or absence of consent. From the viewpoint of information management, the need to directly handle personal information or anonymized processed information also arises from the viewpoint of the party receiving the information as a third party, which increases the risk of information leakage and increases the cost of information management. In addition, anonymized processed information has problems such as dealing with re-identification risk and loss of information due to anonymization. Thus, method (1), which aggregates information on individual cases, is still one of the most widely used data integration methods because it is structurally the simplest and has many advantages in terms of analysis, despite the existence of multiple points of concern.
Along with the aggregation of information on individual cases, one of the most widely used methods is (2), in which information is summarized within a medical institution and the results of the summary are consolidated in a central storage. The simplest example is the fixed-point survey of seasonal influenza epidemics conducted by the Ministry of Health, Labor and Welfare. In the fixed-point survey, the number of patients is tabulated by gender and age group at each medical institution and submitted to the public health center, and the number of infected patients is recalculated at the prefectural level based on the results of each medical institution's total. In this way, if the information required is aggregate values such as the number of patients, the burden of analysis at medical institutions increases, but there is no need to integrate information on individual cases, and the management of information is relatively easy.
Finally, there is a method of creating a prediction and classification model without sharing information on individual cases by repeating the sharing of analysis parameters described in (3). This method has been the subject of much research for practical application, starting with a framework called Federated learning announced by Google Inc. in 20163). It is a derivative of the method of aggregating (2) summary information and, similarly, does not require the provision of patient-by-patient information to a third party. Compared to model building at a single medical institution, it is possible to construct a model with higher external validity.
Examples of information integration methods in COVID-19
In order to understand the latest trends in methods of integrating medical information, this paper summarizes the characteristics of integration methods and how to deal with the protection of personal information, based on the integration methods of EHRs in COVID-19, for which a variety of databases have been created worldwide. As an example of the method of aggregating information on individual cases in (1), a database created with the assistance of government agencies in the United States and the United Kingdom (England); as an example of the method of aggregating summarized information in (2), the largest consortium of international EHR data in COVID-19; and as an example of the method of sharing analysis parameters in (3), a database created with the assistance of the Harvard Medical School Medical School (Harvard Medical School). As an example of the method of sharing analysis parameters, a study conducted by a medical institution at Harvard Medical School and NVIDIA Corporation will be introduced, respectively.
①-1 National COVID Cohort Collaborative ( N3C )4)
(NCATS5), one of the research institutes of the National Institutes of Health in the United States, is leading a project to build an integrated database of EHRs of COVID-19 patients for research purposes. The database contains information on more than 6.4 million COVID-19 patients collected from more than 50 U.S. medical institutions (as of November 20216) ). The data can be used for research purposes by removing personally identifiable information and anonymizing the data in accordance with HIPAA .7) The 18 identifiers to be removed (name, address, date, telephone number, fax number, e-mail address, and social security number) are specified in HIPAA, fax numbers, email addresses, social security numbers, medical record numbers, insurance numbers, account numbers, various license numbers, vehicle numbers, device identification numbers, URLs, IP addresses, biometric identifiers, personal photographs, and other unique identification numbers), N3C has address information (zip codes) and date information for pandemic tracking purposes. date information. Therefore, the security level of the data set provided is classified according to the purpose of analysis, and depending on the level, additional measures are taken, such as approval at the IRB and access restrictions from institutions outside the country (Table 1). In terms of information protection, technical measures have also been taken, and access to the data is restricted within the analysis platform (Palantir Foundry) provided on the Gov-Cloud (Amazon Web Services) that meets the US government's definition of requirements, so that access to information and The entire history of the output of the results is managed centrally. The analysis environment includes Python and R, as well as Apache Spark and BI tools.
1-2 OpenSAFELY8 )
The EHR analysis platform, developed under the auspices of the National Health Service (NHS England) and led by the University of Oxford, was built in the wake of the COVID-19 epidemic. Patient records are provided by electronic health record vendors, and it includes approximately 58 million patient records in the UK, including information on non-COVID-19 patients9) (as of October 202110 )). In the UK, a notice of exception11) has been issued for COVID-19 research using patient information, and consent for research use is not obtained from patients when data is aggregated, and the information collected is protected under the General Data Protection Regulation (GDPR) and the The information collected is protected in accordance with the GDPR (General Data Protection Regulation) and the Data Protection Act, and is pseudonymized using a hash function. In terms of technical aspects related to information protection, the centralized management of data is similar to N3C in that all access to information and the history of the output of results are controlled, but a further feature is a structure that makes it impossible to access the actual data, even for researchers who have been approved for data access. This is achieved through the use of container technology, code management systems, etc., so that researchers can obtain output without ever viewing the data (Figure 2). The source code is also publicly available on GitHub, and the system is designed in an advanced manner to protect personal information, including from the perspective of ensuring transparency of the analysis contents.
2) Consortium for Clinical Characterization of COVID-19 by EHR (4CE)12)
It is an international consortium led by organizations that aim to share, integrate, and standardize medical data, including EHRs, and consists of 315 medical institutions in six countries. The database contains approximately 80,000 patient records (as of October 202113) ). Participating medical institutions perform the analysis within their organizations and manage the aggregate results in a central storage system. Consent for research use is not obtained from patients because the data provided outside the organization does not contain personal information. Participating medical institutions have implemented a standardized platform for medical data and manage information using a similar data structure, so that analysis can be completed at each institution simply by executing a common analysis script created centrally.
3) EXAM (EMR CXR AI Model) consortium14 )
The study, whose results were published in September 2021, used approximately 16,000 patient records. (15) in which approximately 16,000 patient records were used. As with method (2), consent for research use is not obtained from patients because the data provided outside the organization does not contain personal information. The integration methods (2) and (3), which do not involve the transfer of information on individual cases, are often employed in projects consisting of multiple countries that involve cross-border transfer of information.
Table 2 provides an overview of the EHR integration cases presented in this paper.
Characteristics of the database in terms of results obtained
How to integrate information should be considered after clarifying the intended use of the information. However, since the accumulation of medical records takes several years to several decades, discussions on how to integrate information often precede the determination of the purpose of use, in order to have a long-term perspective and to make the same database applicable to a wide range of research. In this paper, we would like to turn back and discuss appropriate methods of integrating data according to the purpose of use.
The validity of analysis using EHRs has been debated from various perspectives, including the definition of outcomes, unmeasured confounding factors, and mechanisms of missing measurements, but this paper proceeds on the assumption that the desired analysis can be performed when information on individual cases is integrated. The analyses conducted using the information obtained from EHRs, which are used as examples in this report, can be broadly classified into the following three types.
- A.Calculation of summary statistics
- B.Construction of classification and prediction models
- C.Comparison among treatments and drugs
A. Calculation of summary statistics
In the COVID-19 case study, this applies to the aggregation of background information such as the number of positive PCR tests, the number of deaths, and the age and gender of the target patients. In this type of analysis, even when summarized information is aggregated, it is possible to obtain results with almost the same accuracy as when information on individual cases is aggregated. For example, if one wishes to create a Kaplan-Meier curve and estimate survival time for outcome information after hospitalization, it is possible to do so by processing the summarized data, and it is not necessary to aggregate information on individual cases (Figure 3).
From the standpoint of privacy protection, unless consent has been obtained from the patient, when anonymized information is consolidated, it is generally processed to remove specific descriptions (e.g., k-anonymization) in order to reduce the risk of information re-identification. In situations where this processing is necessary, it is carried out in the same way when information on individual cases is aggregated, so there is no difference in the impact on the results obtained. On the other hand, by providing the results of aggregation at medical institutions to outside organizations, it is possible to remove the association between variables on an individual basis, thus reducing the risk of re-identification of individuals by combining multiple pieces of information. Thus, except for the burden required at the time of aggregation at each medical institution, there is a significant advantage over the method of aggregating information on individual cases, and a structure that does not aggregate individual information is desirable when calculating summary statistics. However, in the pharmaceutical industry, there are few issues that can be solved only with summary statistics, so the following items also need to be considered.
B. Construction of classification and prediction models
One of the most in-demand analyses using medical records is the construction of classification and prediction models. In the pharmaceutical industry, these models are widely used not only to predict the effects of drugs, but also to adjust for confounding during risk assessment of adverse events and to search for biomarkers. From the standpoint of analysis accuracy, it is most desirable to aggregate information on individual cases, but there are also many studies on analysis methods that approach the accuracy of aggregating information on individual cases while taking into consideration the protection of personal information. This section focuses on the construction of a model to predict the risk of severe disease and death in COVID-19, and introduces the approaches used in each data integration method.
In the method that aggregates information on individual cases in (1), the goal is achieved by having researchers build models in an environment where the information is aggregated. In the N3C in the U.S. and OpenSAFELY in the U.K., as mentioned above, there are restrictions such as analysis in a designated environment and analysis in which actual data cannot be viewed, A model has been constructed to predict the severity of COVID-19 using patient background (age, gender, complications, etc.) and laboratory values16). Since calculations can be performed centrally, various models are applied in an exploratory manner, and studies are being conducted to select the optimal model. Similarly, OpenSAFELY is building models to predict mortality risk based on patient background and comorbidities17).
On the other hand, when a model is constructed by the method of aggregating only summarized information in (2), it is difficult to use information from other medical institutions to adjust model parameters. 4CE constructed a model to predict mortality risk of COVID-19 using patient background and laboratory values, but there, an approach was taken in which each medical institution used its own information to construct a model, and the constructed model was applied to data from another medical institution18) . The approach taken is that each medical institution builds its own model using the information it has, and then applies the built model to the data of another medical institution18). By validating the created models at other medical institutions and integrating the results using meta-analysis methods, an attempt is being made to evaluate external validity while organizing the characteristics of the models on a country-by-country and medical institution-by-institution basis (Figure 4). Behind the implementation of this approach is the fact that the parent body of 4CE is a consortium that aims to standardize EHR data, and the data were managed beforehand using a common data format (CDM: Common Data Model), which is a major contribution to the success of this project. This method also detected that low albumin and low lymphocyte counts affect the prognosis of COVID-19.
On the other hand, the meta-analysis introduced in method (2), which aggregates summarized information, has limitations in terms of maximizing the accuracy of the model because it is not possible to update the parameters of the model using information from multiple medical institutions, and to compensate for this shortcoming, research is being conducted on (3), the analysis In the EXAM consortium's study, prognosis (oxygen requirements) was predicted from chest X-ray images in addition to background information and laboratory values, and the model built with this method was found to have superior accuracy across all medical institutions compared to a model built using the single information held by each institution. The model constructed with this method was reported to have superior accuracy for all medical institutions compared to the model constructed using the single information held by each medical institution15). Sharing analysis parameters multiple times increases the risk of identification of personal information if communications for updating parameters are intercepted, compared to the method described in (ii), in which summarized information is shared only once. For this reason, the EXAM consortium is also studying the use of differential privacy3) to reduce the risk of identification, and trials are being made to achieve the desired results more safely in terms of personal information protection.
C. Comparison between treatments and drugs
Normally, when testing the efficacy and safety of a drug, clinical trials are conducted to compare it with other treatments. This is because randomization and blinding eliminate the influence of factors other than the intervention between the groups being compared. However, in database-based treatment comparisons, it is impossible to apply these procedures, so the most important issue is to ensure comparability, mainly adjustment for confounding. At present, there are few cases in which comparisons between treatments have been made without aggregating information on individual cases. Therefore, this section will consider not the case in COVID-19, but whether studies will be possible in the future without aggregating information on individual cases. Possible situations include cases in which information on both groups to be compared exists in the same database, and cases in which the database is used as a control group in a clinical study conducted with a single group (use as an external control group). Assuming that the most widely used method for adjusting for confounding factors is propensity score adjustment, if data for both groups to be compared exist in the same database, it is theoretically possible to calculate a propensity score by using logistic regression or other methods as described in the section on constructing classification and prediction models. When performing cross-medical institution matching, the propensity score itself must be shared with a third party, and if we pay close attention to the risk of re-identification of personal information, it is desirable to consider avoiding sharing the propensity score with a third party who knows the parameters of its calculation, or adjusting it by weighting it using inverse probability, for example. On the other hand, when used as an external control group, the information held by each medical institution serves as a control for the information obtained in the clinical research, so that the groups of treatments to be compared are completely separated by the data from the medical institution side and the clinical research side. Under such circumstances, it is imagined that it will be difficult with current technology to improve the accuracy of trend score estimation by updating parameters within each medical institution.
Outlook
From the viewpoint of convenience in analysis, it is desirable to unify the management of information on individual cases, but from the viewpoint of personal information protection and the development of corresponding technologies, it can be seen that the integration of information is moving worldwide in the direction of conducting analysis while keeping information on individual cases confidential. The N3C in the U.S., introduced in this paper, initially considered the option of not centrally managing information on individual cases, but decided against it at this time due to concerns about the complexity of the project19). Even when information on individual cases is required, the risk of information leakage can be mitigated by constructing a design that controls the data itself and the processing in the analysis in a cloud environment. Centralized data management is also effective in terms of immediacy of information reflection, and it is hoped that the transfer of anonymized processed information and analysis conducted at on-site centers in Japan will gradually be converted to analysis on virtualized servers such as cloud computing. In addition, all of the cases introduced in this report, regardless of the method of information integration, were made possible by the similarity of data formats using CDM, and it is necessary to continue to focus on the standardization of information in Japan as well. There is a trade-off between the accuracy of the results obtained from the database and the strength of the protection of personal information. As the range of analyses that can be performed without collecting information on individual cases is expanding, it will become more important to determine the optimal data integration method according to the purpose of the analysis and the accuracy sought.
-
1) Pediatric
-
2)
-
3)The Office of Pharmaceutical Industry Research and "Trends in Privacy Protection Technologies and Implications for the Use of Medical Healthcare Data," OPIR Views and ActionsNo. 61 (November 2020).
-
4)
-
5)
-
6)Pfaff, Emily R., et al. Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative. Journal of the American Medical Informatics Association, 2022, 29.4: 609-618.
-
7)We received a consent waiver from the NIH IRB in compliance with the Federal Policy for the Protection of Human Subjects ('Common Rule').
-
8)
-
9)Although the information from the two electronic health record providers is actually handled in a decentralized, integrated manner, this paper, which is described from a health care provider's perspective, treats the data as an integrated database of individual cases.
-
10)Walker, Alex J., et al. Clinical coding of long COVID in English primary care: a federated analysis of 58 million patient records in situ using OpenSAFELY. British Journal of General Practice, 2021, 71.712: e806-e814.
-
11)
-
12)
-
13)Weber, Griffin M., et al. International changes in COVID-19 clinical trajectories across 315 hospitals and 6 countries: a retrospective cohort study. Journal of medical Internet research, 2021, 23.10: e31400.
-
14)
-
15)Dayan, Ittai, et al. Federated learning for predicting clinical outcomes in patients with COVID-19. Nature medicine, 2021, 27.10: 1735-1743.
-
16)Bennett, Tellen D., et al. Clinical characterization and prediction of clinical severity of SARS-CoV-2 infection among US adults using data from the National COVID Cohort Collaborative. JAMA network open, 2021, 4.7: e2116901.
-
17)Williamson, Elizabeth J., et al. Factors associated with COVID-19-related death using OpenSAFELY. nature, 2020, 584.7821: 430-436.
-
18)Weber, Griffin M., et al. International comparisons of laboratory values from the 4CE collaborative to predict COVID-19 mortality. NPJ digital medicine, 2022, 5.1: 1-8.
-
19)Haendel, Melissa A., et al. The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. journal of the American Medical Informatics Association, 2021, 28.3: 427-443.
