Topics Trends in Privacy Protection Technology and Implications for the Utilization of Medical Healthcare Data
Takayuki Sasaki, Senior Researcher, Pharmaceutical Industry Policy Institute, Japan
Introduction
With the spread of IoT, the evolution of AI (Artificial Intelligence), and the advancement of communication technology, the amount of data acquired and utilized in daily life is exploding, and we are entering an era in which value can be provided based on personal data according to the preferences and conditions of the individual. However, the wide variety of data circulating in society includes a great deal of personal data, and ensuring privacy is a prerequisite1). In particular, medical and health-related information is highly sensitive, and everyone has concerns about information leakage and expectations for protection, to varying degrees.
In the COVID-19 pandemic that occurred in 2020, as countries declared a state of emergency, the use of not only data such as the number of infected patients and hospital beds, but also data collected from cell phone location information, communication applications, and transportation IC cards, as well as the spread of "contact tracing applications" that collect and provide information on contact with infected persons, are being promoted. Various data-based public health measures have been developed, including the widespread use of "contact tracing applications " 2). Recently, "search trends" data on COVID-19 symptoms in the U.S. have also been made available from the Google Cloud Platform3). However, concerns about whether this information contains personal information, or whether it is acceptable to restrict an individual's right to control information if it is in the public interest, have also reemerged.
In general, data can be further enhanced in its value through integrated analysis. For example, it is said that in the U.S., more than $300 billion in value can be created annually by sharing data among hospitals, caregivers, pharmaceutical companies, and others .4), 5 It is also pointed out that the number of attributes (p), not the number of data (n), is important to obtain high quality findings from big data. 6). Against this background, integrated analysis of medical information, PHR, IoT data, medical checkup and health checkup data, etc. is being aimed for in the utilization of big data in the medical and health fields7). 7 ) However, the "mosaic effect" problem, in which new privacy disclosures occur when overlapping such data, even if the privacy of individual data is taken into consideration, has also been pointed out as an issue for the enhancement of open data8).
Providers of devices, apps, and infrastructures naturally carefully consider privacy protection and consent acquisition and apply appropriate technologies, but the characteristics of these technologies have not received much attention. A better understanding of privacy protection technologies is one of the necessary steps to alleviate society's concerns about the use of personal data. This paper outlines the technologies that have been developed and implemented for the purpose of privacy protection9).
Examples of privacy-preserving technologies
Some of you may have heard of Privacy Preserving Data Mining (PPDM), a technique for gaining useful insights from data while protecting privacy. PPDM is a generic term for technologies for analyzing data while protecting the privacy of the individuals whose data is being analyzed10). There are various techniques for analyzing data while protecting privacy, but in general, they can be broadly classified into three categories from the data perspective (Fig. 1). In general, from a data perspective, there are three main categories (Figure 1): privacy protection for input data (e.g., anonymous data provision or anonymization of provided raw data), privacy protection at the computation stage (e.g., secret computation), and privacy protection for output data (e.g., differential privacy).
Most of the privacy protection techniques implemented at present are anonymization, which provides a high degree of privacy protection, but on the other hand, it reduces the quality of data from the viewpoint of integrated analysis. Therefore, this paper introduces privacy protection in the computation stage and in output data from the viewpoint of encryption and machine learning, which are expected to become "next-generation privacy protection technologies" that can both utilize data and protect privacy.
The premise is that there is a trade-off between the confidentiality of information contained in data (security) and the acquisition of knowledge from the nature and characteristics of data (usefulness)8). In other words, it is necessary to understand that privacy protection technology must be able to output data with "higher usefulness" while maintaining "appropriate security (not complete security)" in this trade-off relationship.
Secret computation (secret computation)
Secret computation is a cryptographic technique that performs various processes with data encrypted without decrypting it at all11). It is said that by utilizing this technology, data can be combined and analyzed without disclosing any of the original data. In the biotechnology field, it is expected to be applied to sequence analysis such as genome homology evaluation and alignment, and to disease risk prediction by logistic regression analysis, and has achieved realistic performance in DNA edit distance calculation12).
Typical schemes for secret computation include "secret sharing scheme" and "quasi-homomorphic cryptosystem" (Figure 2). In the secret sharing scheme, the confidential data of each organization is distributed into multiple pieces of information that have no meaning on their own (secret sharing), the corresponding multiple data managers combine and analyze the distributed data without disclosing the confidential data, and the analysts collect and restore the analysis results of each manager to obtain the final analysis results, This method is called "quasi-homomorphic cryptography. In contrast, in quasi-homomorphic cryptography, confidential data is encrypted, then combined and processed, and after obtaining the encrypted processing results, the analyst decrypts the results using a key.
Differential privacy
Differential privacy is a method of protecting privacy by intentionally adding statistical noise to the data set, and is a cryptographic technique. This technique is already widely used, for example, Apple has been utilizing this technique since iOS10 (released in 2016) to suggest quick types and emojis, refer to hints, and make Health Type Usage intelligent. 13) COVID-19 Differential privacy is also used in pandemic response, for example, in the "search trends" data related to the symptoms of COVID-19 mentioned above, and Community Mobility Reports also provided by Google14).
As shown in Figure 3, there are three main types of objects to which differential privacy is applied in machine learning15). (i) anonymization of raw data, (ii) anonymization of models obtained by learning from raw data (generation of anonymous models), and (iii) machine learning based on differential privacy when learning from raw data (e.g., noise addition in the output of the activation function of a neural network). The above example corresponds to anonymization of raw data.
Solutions utilizing differential privacy are prominently provided by U.S. data platformers, such as those mentioned above, and there do not seem to be many examples provided by Japanese companies. Although the data is somewhat old, almost all patents filed for differential privacy between 2006 and 2015 were from the U.S., and Japan's focus on secret and confidential calculations may also have had an impact (Figure 4) 16).
Coalitional learning
On the other hand, federated learning has attracted attention as a technology that is applied when the goal is not to analyze data while protecting privacy per se, but to improve the accuracy of machine learning based on data that spans multiple organizations. This technique is a method of deep learning, the framework of which was announced by Google Inc. in 201717). The mother model is then updated by returning the difference between the model before and after the update to the mother model (Figure 5).
This method is unique in that the data does not need to be shared or transferred between devices and is always stored in the local environment. TensorFlow and PyTorch, the leading machine learning frameworks, already allow for federated learning .18), 19 A similar machine learning method is "distributed learning," but distributed learning assumes that the data set is identical, whereas federated learning requires a different data set. The unique feature of coalition learning is that it can use a variety of data.
Coalitional learning is particularly suited to fields such as finance, medicine, and healthcare, where highly sensitive information is used to train AI. For example, Owkin, a U.S. startup that provides a digital research platform for the medical field, uses coalition learning as a method to enhance algorithms without moving data20). Earlier this year, a method for analyzing electronic medical records (EMRs) while protecting privacy through federative learning was also investigated, and in the task of distinguishing healthy brain tissue from cancerous from MRI images, federative learning was reported to perform as well as traditional data aggregation models (Centralized Dada Sharing) and 21).
In addition, federated learning has begun to be considered as a completely decentralized "Decentralized Federated Learning," moving away from the previous mechanism that primarily focused on updating a central common model22). This method is a departure from the so-called "centralization" and is compatible with blockchain technology, and will likely attract more attention as a method of federated learning that improves tamper-resistance and transparency.
In Japan, a similar technique, "privacy-preserving machine learning," exists, and is being implemented in the financial industry. For example, since it is difficult for a single financial institution to prepare a sufficient amount of teacher data for an automatic fraud detection system using deep learning technology, there are efforts to improve detection accuracy by integrating the results of learning based on data from more banks. As an example, the National Institute of Information and Communications Technology (NICT), in collaboration with five financial institutions including Bank of Mitsubishi UFJ and Sumitomo Mitsui Trust Bank, has started a demonstration test for detecting illegal money transfers using privacy-preserving deep learning technology (DeepProtect) in 2020 (Figure 6) 23). In this demonstration test, it is possible to update the program using the parameters (weights) of the learning model encrypted by quasi-homomorphic encryption technology.
Issues Related to Privacy Protection Technology
Although the privacy protection technologies described above are promising because they can both protect privacy and utilize data, there are some issues to be addressed.
For example, encryption technology always carries the risk of leakage of confidential information if the ciphertext is deciphered by an attacker. In the case of secret sharing, there is a possibility that the original confidential data can be restored if there is collusion among data administrators. In addition, since quasi-homomorphic cryptography uses encryption keys, secure management of keys (e.g., separation of ciphertext and keys) is necessary.
There are also issues related to computational power and processing speed. For example, it has been pointed out24) that a common issue in secret computation is that the process is several tens to several thousand times slower than normal processing because of the large number of communications involved. In addition, even in federated learning, each local node may need to have the computing power to perform large-scale machine learning operations. Even if the computational power of local nodes improves as edge computing progresses, it is not necessarily efficient to always force machine learning, especially deep learning that requires computational power, to the edge side.
In addition, in secret computation and federated learning, the algorithm enforcer cannot check the quality of the teacher data, and the quality of the data set depends on quality control at the edge node. Fairness, accountability, and transparency (FAT) are becoming increasingly important for AI and data, and it will be necessary to visualize what data sets have contributed to the improvement of AI quality and to what extent25).
Furthermore, as a legal issue, there is a concern that the regulations in Article 23 of the Personal Information Protection Law may be an obstacle to the use of secret computation technology, since encrypted personal data will be provided to a third party. Under the current Personal Information Protection Law, "even if personal information is encrypted, it is still personal information," and after sorting out in what sense the theory that "even if it is encrypted, it is still personal information" has been advocated, it is necessary to examine the applicability of data exchange based on secret computation technology to personal data and make a proposal. The need to examine the applicability of data exchange based on secret computation technology to personal data and summarize the results in a proposal has been called for26). In order to utilize such advanced privacy protection technologies, it will be necessary to discuss them from a broad perspective, with a view to revising related legal systems.
Conclusion
As described above, privacy protection technologies have made rapid progress over the past several years, partly due to the growing momentum regarding the protection of personal information, and many technologies have shifted from the research stage to the practical stage. On the other hand, it is not necessarily true that the concept of these technologies and their advantages and disadvantages have been widely spread in the general public, or that citizens actually feel that privacy protection has been enhanced.
In addition, some of the privacy protection technologies have issues in terms of computational resources and communication speed, and their lack of general applicability makes their development costs prohibitive. These cost factors will be passed on to users (individuals, companies, governments, etc.) in the form of increased costs, and it is important to remember that privacy protection also has a cost-benefit aspect.
Furthermore, even with these privacy protection technologies, there are issues related to transparency, such as the fact that individuals have no way to check whether the updated data sent from the device contains personal data, and the extent to which transmitted calculation results, differences, or noise-added or encrypted personal information is subject to privacy protection laws. As mentioned above, there are also legal systemic issues such as the extent to which transmitted calculation results, differences, or noise-added or encrypted personal information should be covered by the Personal Information Protection Law. In addition, as mentioned above, there is a trade-off between the confidentiality of information and the acquisition of knowledge, and it will be necessary to consider how to set the use of information according to its confidentiality or, conversely, how to ensure confidentiality according to the purpose of its use.
The Commerce and Information Policy Bureau of the Ministry of Economy, Trade and Industry (METI) has also mentioned the need to improve the environment for the implementation of such technologies in its "New Approaches to Medium-Term Challenges Facing IoT Progress," which includes a policy direction to "study the development of data coordination environment by utilizing secret dispersion and computation technologies. However, in meetings that include people who actually use data, such as the Ministry of Health, Labor and Welfare's "Consortium for Accelerating AI Development in the Health and Medical Care Field," there are statements about the importance of privacy protection, but no concrete plans for how to solve this issue from a technological perspective or how to deepen public understanding of the issue. The current situation is that the public does not necessarily understand how to solve this issue from a technical perspective or how to deepen public understanding27). It seems necessary for the various levels of study bodies to have those involved in medical health data utilization (including not only researchers but also policy makers, medical professionals, and data-using companies and organizations) learn about these technologies, take an interest in them, and offer their wisdom to each other.
The Digital Agency, which is scheduled to be established in 2021, will need to discuss not only "defensive digitalization" as typified by so-called digitization, but also "offensive digitalization" with an eye toward the utilization of data, in order to promote industry. In particular, in order to utilize medical health data for the creation of new innovative drugs, the pharmaceutical industry should understand the usefulness of privacy protection technologies, plan and participate in demonstration experiments, evaluate the methods, and propose the use of such technologies.
The evolution of privacy protection technologies is rapid, and it is possible that technologies will evolve significantly from what was initially envisioned when the plans for data utilization were made. In order to reflect this "speed of technology" in policy, it is important to involve many technology experts in advance, and it is also important to be flexible in changing roadmaps and research plans by promptly incorporating technologies and social changes in other industries such as finance and transportation.
-
1) Number of reports and countries from which data was obtainedJapan Business Federation, "Society 5.0 - Creating the Future Together" (2018.11.13)
-
2)"Data Utilization in a New Coronavirus Infection Epidemic: A Case Study of Contact Confirmation Apps," Policy Research Institute News No. 60
-
3)"COVID-19 Search Trends symptoms dataset." (Google Cloud Platform)
(viewed September 12, 2020) -
4)McKinsey Global Institute, Big data: The next frontier for innovation, competition, and productivity, May 2011
-
5)McKinsey Global Institute, The "big data" revolution in healthcare - Accelerating value and innovation, January 2013.
-
6)Japan Omics Society, "Expectations for artificial intelligence in the era of big data healthcare," 2019.11.29.
-
7)Report of the Study Group on Big Data Utilization in Medical and Health Care Fields vol. 5," Pharmaceutical and Industrial Policy Research Institute.
-
8)Masayuki Terada, "What is Differential Privacy?" System/Control/Information, Vol.63, No.2, pp58-63 (2019)
-
9)Such technologies may be useful not only for privacy protection related to personal medical health information, etc., but also for the use of information that should be kept confidential across organizations, for example, sharing of intra-company data related to the creation of new drugs.
-
10)Katsumi Takahashi, "Privacy Protection Data Mining," Systems/Control/Information, Vol. 63, No. 2, pp. 43-50 (2019)
-
11)Takao Takenouchi, "Discussions and Recent Developments at PWS2018 on Secret Computation," (March 7, 2019 PWS Meetup NEC Security Laboratory presentation)
-
12)S.Laur,e et al. "From Oblivious AES to Efficient and Secure Database Join in a Multiparty Setting" Applied Cryptography and Network Security, 84-101 (2013)
-
13)APPLE, Inc. "Differential Privacy." (viewed September 12, 2020)
-
14)"google mobility report." (viewed 2020.9.12)
-
15)"Proposal of a neural network model construction method satisfying differential privacy," Yuichi Kiyoshi et al, 32nd Annual Meeting of the Japan Society for Software Science and Technology (FY2015)
-
16)FY 2009 Patent Application Technology Trend Survey Report (Summary) Anonymization Technology
-
17)https://ai.googleblog.com/2017/04/federated-learning-collaborative.html (viewed 12.9.2020)
-
18)https://developers-jp.googleblog.com/2019/03/tensorflow-federated.html (viewed on 2020.9.12)
-
19)https://pytorch.org/ (viewed on 2020.9.12)
-
20)Owkin HP (viewed 21.9.2020)
-
21)Sheller et al. Scientific Reports 10, 12598 (2020)
-
22)Yuzheng Li, et al. "A Blockchain-based Decentralized Federated Learning Framework with Committee Consensus". (viewed 11.10.2020)
-
23)
-
24)"Introducing NEC's Secret Computation Technology." (viewed on 2020.9.12)
-
25)As of this writing in September 2020, the European Conference on Machine Learning and Principles (ECML) "BIAS 2020". The topic of fairness in federated learning has been raised at the ECML (the European Conference on Machine Learning and Principles) "BIAS 2020" as of this writing in September 2020. We hope that these discussions will progress in Japan as well.
-
26)
-
27)Ministry of Health, Labour and Welfare, Consortium for Acceleration of AI Development in Health and Medical Fields, 1st-11th documents and "Process Chart Based on Arrangement of Discussions and Future Direction".
