Points of View Information flow in research use of genomic information
Hodai Okada, Senior Researcher, Pharmaceutical Industry Policy Institute
Importance of Technical Measures for Personal Information Protection
The revised Next Generation Medical Infrastructure Act was passed and enacted in the 211th Diet session, and the use of "Pseudonymous Processed Medical Information" will newly begin. Under the existing Next Generation Medical Infrastructure Act, medical information can be used for research and development of pharmaceuticals through anonymous processing. Pseudonymized medical information is defined as information that has been processed so that individuals cannot be identified unless it is matched with other information.) In addition, by making it possible for review authorities to inquire about information for authenticity confirmation, information can also be used for applications for pharmaceutical approval, etc. The research use of pseudonymized medical information was requested by the pharmaceutical industry in the process of discussing the revision of the law2), and it is hoped that this will encourage the pharmaceutical industry to use medical information.
2) On the other hand, during the discussions in the Working Group for the Next Generation Medical Infrastructure Act, where this amendment was discussed, and during the deliberations in the Diet, several opinions were raised regarding the safety management measures for authorized user business operators who will receive and manage pseudonymized medical information, restrictions on the departments within the organization that will use the information, and strengthening of regulations on re-identification of personal information, etc. In the case of using pseudonymized medical information, the burden of information management on users is expected to increase. In terms of the protection of personal information, looking at the system for research use of genome information, which is the most sensitive biometric information, cloud and container-based virtualization technologies are being used to strengthen data governance against information leakage and unauthorized use3). These efforts not only reduce the burden of security control measures imposed on information users, but also reduce the risk of information being used for unintended purposes, benefiting both information providers and users. As the use of cloud computing and containers as a foundation for information management advances, the protection of information when combining and analyzing information managed independently by multiple institutions is also being considered. These technologies are also expected to be applied to biobanks in Japan and to collaboration among providers of anonymized medical information under the Next Generation Medical Infrastructure Act.
In order to foster an environment in which information providers can provide information with trust, it is important not only to strengthen regulations to deter leaks and unauthorized use, but also to take technical measures to prevent them in advance. The system in research use being considered for genome information has many examples that can be applied not only to genome information but also to other medical information. As a precedent case that provides useful suggestions for the development of a medical information coordination system in Japan, this paper focuses on the case of Genomics England regarding the governance of data in a single institution, and introduces the use of genome information between nations as a case of combining information managed by multiple institutions. As an example of how information managed by multiple institutions is combined and used, the paper introduces the research framework developed by GA4GH (Global Alliance for Genomics and Health), an international cooperative organization that aims to develop medicine and medical science using genome information through the use of genome information among nations.
Trends in Medical Information Coordination
Not only genome information, but also medical information managed by medical institutions and biobanks, etc., is being shared with other medical institutions and pharmaceutical companies to improve the efficiency of medical care and drug discovery research, and the Japan Pharmaceutical Manufacturers Association (JPMA) released an educational booklet for the public on the use of medical information in April this year4). The Headquarters for Promotion of Medical DX, established by the Cabinet in October 2022, also aims to improve the quality of medical care provided to the person providing information as well as drug discovery research by sharing healthcare information, as indicated in a document submitted by the Ministry of Health, Labor and Welfare5) (Figure 1). Looking at other countries, the spread of the new coronavirus infection is also a factor, and medical information sharing is being promoted, mainly in Europe and the U.S. However, the World Economic Forum has indicated that 97% of information held by medical institutions is not available as of 20196), and improving the quality of medical care through the use of medical information is, 6) Improving the quality of healthcare through the use of medical information is an issue that is being focused on not only in Japan, but also around the world. 6) The improvement of interoperability of medical information has been discussed mainly by type of information (FHIR - electronic medical records, DICOM - medical images, IHE - system integration, etc.). In these efforts, the definition of information structure and exchange rules have been developed mainly for the purpose of providing efficient medical care to patients who have provided their own information, and in recent years, consideration of secondary use, such as research and development, has also begun. GA4GH, which promotes the linkage of genome information, has classified 10 problems that arise in the process of promoting secondary use, and is working to solve each problem7).
Main barriers to secondary use of medical information
- 1.Uniformity of information generation processes
- 2.Interoperability (data model and terminology)
- 3.Information management infrastructure
- 4.Access to information
- 5.Obtaining consent to use information
- 6.Privacy and security regulations
- 7.Transparency to information providers
- 8.Information sharing with other countries (experience and trust)
- 9.Incentives for information sharing and use
- 10.Obligation to share data
The technical measures for personal information protection described in this paper are mainly effective means to address the issues of "information management infrastructure" and "regulation of privacy and security. The debate on information management, along with the unification of operational standards, is a major barrier to the coordination of medical information. In the U.K., the Care.data project had already been in place since 2013 to consolidate information held by medical institutions into a centralized database for secondary use of medical information, but the project failed to gain public support due to the lack of proper implementation of privacy protection procedures, and was The project was discontinued8). In recent years, although there has been a certain level of understanding of the use of medical information compared to that time, a survey by the American Medical Association released last year also showed that there is still significant concern among informants about the secondary use of medical information9). In order to facilitate the use of information, it is necessary to establish a system to reduce the psychological concerns of information providers and to gain their trust, and privacy tech is attracting attention as a method to protect personal information technologically.
Balancing Information Use and Protection
First, the analysis environment at Genomics England (a disease cohort for whole genome sequencing) in the U.K. will be introduced as an example of a case study that aims to balance information use and personal information protection, starting from the establishment of a system for information use. The concept of the Five Safes framework exists as a framework for efficiently using sensitive information for research10). This is a concept proposed by the UK's Statistics Authority, which categorizes factors that should be considered when using information into five categories. It is also taken into account in research using information in the health insurance sector, particularly the National Health Service (NHS), the UK's national health service.
Five Safes framework
- 1.Safe People (appropriate use procedures)
- 2.Safe Projects (appropriate purposes of use)
- 3.Safe Settings (restrictions on unauthorized use)
- 4.Safe Outputs (management of analysis results)
- 5.Safe Data (management of information)
In the U.K., TRE (Trusted Research Environment) is an environment in which sensitive information can be used for research in a safe and secure manner, and aims to create an environment in which analysis can be performed while protecting the privacy and security of sensitive information. TRE is designed around the Five Safes framework, which is also used by Genomics England and OpenSAFELY (NHS patient record analysis infrastructure), both of which involve the NHS and the Department of Health.
Genomics England provides TREs in the cloud to users who conduct analyses, and users basically use this environment to handle information. 11) In addition to the above Five Safes, Genomics England manages information with seven elements in mind, including two additional study-specific items. The information is managed with seven elements in mind, including the above-mentioned Five Safes plus two additional research-specific items. Figure 2 shows an overview of TRE and the process by which the Five Safes framework is expected to be adapted: TRE is accessible only by authorized users (Safe People) via virtual desktops (Safe Settings), and users are only allowed to access de-identified information (Safe Settings). The only information accessible to users is de-identified information (Safe Data). The content of the research is reviewed by the custodian of the information or by a third party to ensure validity and transparency (Safe Projects); only analytical results can be taken out of TRE, and only once they have been verified can they be transferred (Safe Outputs). Genomics England differs from other projects in that it uses the public cloud to access large scale information and provides a portion of the results as a report to the informant via the attending physician, with consideration given to the protection of personal information in each process (Safe Computing, Safe Return). Safe Computing, Safe Return). Best practices for each of the Five Safes have been published by the relevant organizations of the National Institute in the UK, and excerpts of the features considered important are shown in Table 1.
The use of TREs as described above can be very effective when conducting analyses within a single institution, but the strict restrictions on taking information outside the TRE make it difficult to conduct analyses that combine information managed by multiple institutions for the purpose of improving the accuracy and verification of analysis results. Genomics England will conduct a demonstration study to solve this problem by combining information held by the University of Cambridge's research center with information from other medical institutions. Genomics England has established a system to analyze medical information by combining it with information held by the University of Cambridge's Research Center12). The method of conducting analysis without taking information out of each institution is called the Federated Approach, in which analysis queries are shared with each institution, and the results executed by each institution are aggregated to obtain integrated results (Figure 3). In this project, due to time constraints, the data center for integration was located at Genomics England's TRE, but it is noted in the report that it is originally desirable to have it in an independent environment. As described above, the UK is beginning to adopt a new structure for information use, particularly in the field of genomics, where data is becoming increasingly standardized.
System of Information Use
In the case study of Genomics England, we introduced the information use system using the cloud environment and the Federated Approach, but the method of sharing medical information that is widely used at present is to reproduce and distribute the information itself via a file server or electronic storage media. The Next Generation Medical Infrastructure Act also introduced a guideline for the use of medical information. In the Next Generation Medical Infrastructure Act, it is stated in the guideline13) that "In addition to telecommunication and portable storage media, on-site viewing of anonymized processed medical information is also assumed as a method for providing anonymized processed medical information to business operators handling anonymized processed medical information. This means that when a pharmaceutical company conducts research, anonymized processed medical information is assumed to be stored in storage managed by the pharmaceutical company (business operator handling anonymized processed medical information) as the main method. On the other hand, as introduced in the case study of Genomics England, the system of information use is undergoing a transformation mainly in Europe and the United States due to the advancement of cloud technology and container-based virtualization technology. In a paper published in 2021 by GA4GH, an organization established to promote international collaboration of information on genomics and health, three approaches to information use are introduced14) (Figure 4).
The method shown in (1) is to centralize information on a centrally located file server and then transfer the aggregated information to the user's base. This method is also envisioned in the above-mentioned Next Generation Medical Infrastructure Act, in which an authorized medical information handling contractor is in charge of information aggregation. The method shown in (2) is a method in which users access aggregated information through an analysis environment prepared in the cloud. This method is similar to the "on-site viewing method" in the Next Generation Medical Infrastructure Act guidelines in that medical information prior to analysis is not stored in storage managed by the pharmaceutical company. The method shown in (3) is a method in which analysis queries provided by users are executed within each institution and only the analysis results are aggregated, without aggregating the medical information before analysis. The methods shown in (1) and (2) are beginning to attract attention in recent years as a method to further technologically strengthen the protection of personal information, since users are able to view medical information before analysis. The aforementioned TRE of Genomics England was constructed using method (2), and the joint project with the University of Cambridge falls under method (3). Many of the concerns introduced at the beginning of this article regarding the use of pseudonymized processed medical information stem from method (1), which was adopted when using anonymized processed medical information. Methods (2) and (3) make it unnecessary to duplicate information, and information is not stored in storage managed by the entity that uses the pseudonymized processed medical information. This method is superior to method (1) not only in terms of information leakage, but also in terms of immediacy and traceability of information.
As seen in the Genomics England case study and the COVID-19 case study introduced in Policy Research Institute News No. 68, the method using the cloud environment described in (2) is already being used in various situations, mainly in Europe and the United States, and the Tohoku Medical Megabank Organization in Japan has also started to provide access to information from remote locations (15). The Tohoku Medical Megabank Organization in Japan has also made it possible to access information from remote locations15). In Japan, the Tohoku Medical Megabank Organization also allows access to information from remote locations15). Since the decentralized information use system of (3) above does not allow access to information managed by all institutions simultaneously, there are many issues to be resolved at this point, such as improving the accuracy of analysis results and developing a system for conducting analysis, However, there are many issues that need to be resolved at this point, such as improving the accuracy of analysis results and establishing a system for conducting analysis.
Trends in Genome Information Linkage
The most practical application of the new information utilization system is currently in the field of genome research. It is estimated that genome information used for genome-wide association studies (GWAS: genome-wide association study), especially with regard to racial differences, is dominated by European descent (78% or more) and African descent (only 2.4% or so18). The results of the GWAS showed that the risk of disease onset, etc ., was predicted to be about 2.4%.) 18 ) As a result, it has been reported that the accuracy of the Polygenic Risk Score (PRS), which predicts the risk of disease onset, was approximately 4.5 times lower in African descent than in European descent19). Such information imbalance may cause bias in the estimation of efficacy in new drug development and in the estimation of the risk of adverse drug reactions. To solve this problem, it is necessary to conduct analysis using information derived from diverse ethnic groups, but since there is naturally a bias in the distribution of ethnic groups in each country, international information coordination is essential. However, there are many countries that have strict restrictions on the cross-border transfer of genome information, and demand is extremely high for obtaining highly accurate analysis results without sending personal information out of the country. In the field of genome research, there are several projects to make practical use of the information use system introduced in the previous section, and some of these efforts are introduced below.
Consortiums for Efficient Information Linkage
Although this paper focuses on how information is used, there are various issues that need to be resolved even before information is used, such as obtaining consent from information providers and standardization of data structures. There are several organizations working to solve these problems, including OHDSI20),CDISC21), and GO FAIR22). This paper introduces GA4GH, which has many Japanese research institutes as members and is making influential efforts in the coordination of genome information, as described in the section on the use of information, In Japan, the organization's members include the Tohoku Medical Megabank Organization and the Japan Agency for Medical Research and Development (AMED), which aims to establish a large-scale platform for the integrated use and utilization of genome information (CANNDs) 23). In Japan, the Tohoku Medical Megabank Organization and AMED (Japan Agency for Medical Research and Development), which aims to establish a large-scale integrated platform for the utilization of genome information (CANNDs)23), are also members of GA4GH. The aforementioned project between Genomics England and the University of Cambridge also uses the standard process developed by GA4GH. In Japan, it was announced that RIKEN, as a project of GA4GH, has conducted variant evaluation with the QIMR Berghofer Institute for Medical Research in Australia using genomic information stored by both organizations, without exchanging genomic information itself24). This was achieved by exchanging analysis queries between the two organizations, and was carried out using container-type virtualization technology (method shown in (3) of information collaboration).
Examples of Genome Information Use
In Japan, the technologies and operational policies developed in GA4GH are beginning to be adopted by genome consortia in various countries, as public organizations such as AMED and RIKEN are participating in this project. This paper introduces some examples of advanced efforts in information collaboration.
ELIXIR
ELIXIR, an intergovernmental organization for making biological information available across Europe, is a consortium of research institutes in 23 European countries, and in collaboration with GA4GH, is conducting a project on Beacon technology to enable cross-search of genome information managed by participating institutions. Beacon technology is an API that provides information on the presence or absence of specific mutations in the genome information managed by each organization, and is one of the Federated Approaches that enable searches across information from all participating organizations without sharing the information itself (Figure 5). As of 2021, 42 institutions are using Beacon, enabling searches from more than 1 million samples25). The project is still ongoing, and it is becoming possible to use Beacon to add filter conditions other than genomic information and to check the conditions of use when accessing the actual information. Currently, Beacon technology is also used in the database of the National Bioscience Database Center (NBDC) in Japan. This technology can be applied not only to genomic variations, but also to diseases and drugs used, and it is thought to be applicable to cross-search among biobanks and providers of anonymized medical information under the Next Generation Medical Infrastructure Act in Japan.
CanDIG
The platform is designed for the use of medical information in Canada, and at present, five projects that manage genome information in Canada are participating in the platform. Canada is a federal country consisting of several provinces, and regulations regarding the privacy protection of medical information are subject to the laws of each province, making it difficult to aggregate individual information even within the country. To address this problem, the decentralized information use regime developed by GA4GH has been adopted, and CanDIG has adopted a de-centralized model, avoiding the establishment of a centralized system with a management organization, in order to allow each participating institution holding information to fully control the use of the information under its control. 26). Even for Japanese providers of anonymized processed medical information consisting of multiple independent entities, it may be difficult to establish a centralized organization to manage the project, and these efforts may serve as a reference for future operation. Such decentralized information management has been attracting attention in recent years in order to avoid monopolization of information by platforms, etc. However, the absence of a management organization increases the number of matters to be considered. The following points are considered by CanDIG as points to keep in mind when adopting such an information infrastructure.
Authentication (AuthN)
Authentication refers to confirming the identity of the individual accessing information. In the case of centralized information management, the information is generally managed centrally by the organization that manages the aggregated information. Therefore, the work of identity verification is duplicated in multiple organizations. Therefore, the GA4GH has introduced a mechanism called GA4GH Passport, in which the user's ID information can be used by all participating organizations through OpenID Connect (a method of confirming the identity of the user based on the authentication of the authorization server), and this authentication method is also used by CanDIG. This authentication method is also used by CanDIG.
Authorization (AuthZ: authorization)
Authorization refers to the determination of information that can be viewed and analyses that can be performed by authenticated individuals and organizations. When de-centralized information management is used, there are many aspects that make it difficult to delegate decisions regarding access to information to outside agencies, and the final decision on authorization is made by each agency that manages the information. decisions are made along with the decisions of the project's Data Access Committee.
Analysis Flow
If de-centralized information management is used, the analysis must be performed at each agency, which also requires control over how the analysis is performed. The simplest method would be for users to send queries to all institutions individually to obtain results, but when results are consolidated, it becomes possible to identify which institution each result originated from, increasing the risk of information leakage. In addition, if a specific institution is responsible for the integration of results, the benefits of adopting a decentralized information management system will be diminished. In light of these issues, CanDIG uses a peer-to-peer approach, which is a middle ground between these two approaches. The organization that first requests an analysis makes a request to all other organizations for analysis, and the results are integrated to enhance the efficiency of analysis and the protection of personal information.
Sharing of results at each institution
As mentioned in the analysis flow section, when analysis is performed within each institution, the risk of information leakage increases if the name of the institution and the results are linked when the results are integrated. At present, CanDIG does not employ these processes because of the small number of participating institutions and the strong trust among them, but as the program evolves, it is expected that the introduction of these processes will be considered. However, as the program evolves, it is anticipated that the use of these processes will become necessary.
Access control to information
As information is distributed across agencies, access history is also managed by each agency, resulting in siloed access information. At CanDIG, at present, each organization manages its own access history, and the entire history is managed through collaboration among the organizations. However, as the volume of information and the number of participating institutions grows, this type of management becomes untenable, and CanDIG is considering moving to a method that would allow for the management of information access history on a global basis.
Other Initiatives
The standards developed by GA4GH are beginning to be used not only in Europe and Canada, but also in other regions and disease areas. The National Institutes of Health (NIH) in the United States and the Wellcome Trust in the United Kingdom, in collaboration with the African Academy of Sciences, have established a consortium called H3Africa to conduct large-scale genomic studies on African populations in an attempt to address the information imbalance. Other GA4GH standards are also being used in the CINECA project, which aims to exchange information across continents (Europe, Canada, and Africa) and in MATCHmaker, a genome platform for rare diseases, as previously described.
Summary
This paper introduces the technical measures that are being considered for adoption in information linkage, focusing on genomic information, which is treated as the most sensitive personal information. Since the methods introduced in this article are developed to protect personal information, the first priority is to protect the provider of the information, but they also facilitate the management of personal information from the standpoint of the pharmaceutical companies that use the information. We believe that the pharmaceutical industry needs to promote the introduction of this system for the secondary use of medical information. In particular, the structure of Genomics England, which combines the Five Safes framework with a cloud environment, may serve as a reference point for the stronger information management required of certified user businesses in accordance with the revision of the Next Generation Medical Infrastructure Act. In biobanks, the Tohoku Medical Megabank Organization has already made it possible to access information from remote locations and use supercomputers, and it is expected that other biobanks will continue to develop their analysis environments as well. Furthermore, the inability to provide sensitive information outside of an institution makes the integration of information difficult, and similar cases to those described in this paper are occurring in biobanks in Japan and in the anonymous processing medical information providers under the Next Generation Medical Infrastructure Act. It is difficult for users to know which entity holds the information they are seeking. As for biobanks, AMED has announced that it will build a cloud-based analysis environment in its plan to make large-scale genome information searchable in Japan27). The structure of information access needs to be considered according to the extent to which the information held by each biobank can be analyzed using detailed information in an integrated search environment. With regard to the Next Generation Medical Infrastructure Act, as the number of providers creating pseudonymous processed medical information increases in the future, it is assumed that information collaboration among providers creating pseudonymous processed medical information will be required. The structure of information linkage in the genome field, which is being carefully studied to handle sensitive information, has many viewpoints that can be applied to the use of medical information in Japan, and it is expected that not only the strengthening of legal measures but also discussions on the strengthening of information protection and promotion of information use using new technologies will progress. The following is a list of some of the most important issues that need to be addressed.
-
1) Number of reports and countries from which data was obtained
-
2)
-
3)
-
4)
-
5)
-
6)
-
7)Rehm, Heidi L., et al. GA4GH: International policies and standards for data sharing across genomic research andhealthcare.
-
8)Limb, Matthew. Controversial database of medical records is scrapped over security concerns. BMJ: British MedicalJournal, 2016, 354.
-
9)
-
10)Ritchie, Felix. Secure access to confidential microdata: four years of the Virtual Microdata Laboratory. Economic &Labour Market Review, 2008, 2: 29-. 34.
-
11)UK Health Data Research Alliance, Building Trusted Research Environments - Principles and Best Practices;Towards TRE ecosystems, 2021
-
12)Data and Analytics Research Environments UK, Multi-party trusted research environment federation: Establishinginfrastructure for secure analysis across different clinical-genomic datasets, 2022
-
13)Cabinet Office, Guidelines for the Act on Anonymously Processed Medical Information for Research and Development in the Medical Field, 2023
-
14)Thorogood, Adrian, et al. International federation of genomic medicine databases using GA4GH standards. Cellgenomics, 2021, 1.2: 100032.
-
15)
-
16)Beyan, Oya, et al. Distributed analytics on sensitive medical data: the personal health train. Data Intelligence, 2020,2.1-2: 96-107.
-
17)The World Economic Forum, Federated Data Systems: Balancing Innovation and Trust in the Use of Sensitive Data,2019
-
18)Atutornu, Jerome, et al. Towards equitable and trustworthy genomics research.EBioMedicine, 2022, 76: 103879.
-
19)Martin, Alicia R., et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Naturegenetics, 2019, 51.4: 584-591.
-
20)
-
21)
-
22)
-
23)
-
24)Casaletto, James, et al. Federated analysis of BRCA1 and BRCA2 variation in a Japanese cohort. Cell genomics, 2022,2.3: 100109.
-
25)Harrow, Jennifer, et al. ELIXIR-EXCELERATE: establishing Europe's data infrastructure for the life science researchof the future. EMBO Journal, 2021, 40.6: e107409.
-
26)Dursi, L. Jonathan, et al. CanDIG: Federated network across Canada for multi-omic and health data discovery andanalysis. 100033.
-
27)
