Points of View Current Status of Biobank (UK Biobank) Utilization
Hodai Okada, Senior Researcher, Pharmaceutical Industry Policy Institute
Collaboration between the Pharmaceutical Industry and Biobanks
The UK Biobank is a large-scale data source that enables the linking of genomic and other omics data with phenotypes in clinical information. In Japan, the collaboration between Tohoku Medical Megabank Organization and the Japan Pharmaceutical Manufacturers Association will start in 2020, and in March 2021, Tohoku Medical Megabank Organization and five pharmaceutical companies will launch the "Consortium for Integrated Analysis of Whole Genome Information and Medical and Health Information In March 2021, the "Consortium for Integrated Analysis of Whole Genome Information and Medical/Health Information1) " was launched by Tohoku Medical Megabank Organization and five pharmaceutical companies (Table 1). Biobank research in collaboration with pharmaceutical companies is led by UK Biobank in the U.K., which is actively conducting genome-wide association studies (GWAS), a method of comprehensive genome-wide search for genetic polymorphisms, and FinnGen in Finland. The characteristics of the biobanks with which they collaborate and the nature of the collaborations suggest that there is high demand for large-scale data related to genome information from pharmaceutical companies. Although the number of biobanks conducting biobank research is increasing worldwide, the number of biobanks collaborating with pharmaceutical companies is currently limited to a few, which may be due to differences in the policies of each biobank regarding commercial use of samples and information and access to information from abroad. The pharmaceutical companies only work with disease biobanks. The fact that pharmaceutical companies are collaborating not only with disease biobanks but also with general population biobanks indicates that they need information not only on specific diseases but also on the general population. In the future, large-scale data from biobanks may become an important source of information in drug discovery research, and the pharmaceutical industry needs to closely examine the system for conducting research using biobank samples and information, which is mainly conducted by academic and medical institutions, and the usefulness of the accumulated samples and information, and consider how to participate in the research. The following is a brief overview of the issues that need to be considered. In this paper, we surveyed the current status of the UK Biobank based on published academic papers and other information that can be used as a reference for these considerations.
Data Access Policy
However, each biobank has a different policy regarding the scope of research that can be conducted and the restrictions on who can use the biospecimens, and each biobank has its own policy regarding who can access the information, including researchers in their own country and researchers affiliated with public research institutions. There are also many biobanks that limit access to information to researchers in their own countries or researchers affiliated with public research institutions2). The UK Biobank, which has contributed to the generation of the greatest number of research results, was established from the beginning with the aim of becoming an open access resource that makes accumulated samples and information as widely available as possible. Applications are widely accepted from outside the UK for any purpose3). In contrast, All of Us, a general population biobank in the U.S., does not currently allow access to its data to commercial entities or researchers outside the country, and differences in information sharing policies exist even among biobanks that collect similar samples and information4). These differences in usage policies are due to various factors, such as the description of the scope of provision of samples and information in the consent document, the objective of ensuring international competitiveness of genome research within one's own country, the possibility of establishing a large-scale genome information sharing platform, and concerns about the protection of personal information, The consent document clearly states that samples and information will be made available to researchers and commercial companies outside of Japan, and they are conducting omics analyses on a large scale in collaboration with pharmaceutical companies. Access to those large-scale data is also available through a dedicated analysis environment built on the cloud, and the volume of information and ease of access are thought to have contributed to the increased frequency of use. Although there are concerns about expanding the scope of access to samples and information, such as information protection, as mentioned above, the fact that the UK Biobank's samples have contributed to the generation of more scientific papers than any other biobank shows that, from the aspect of conducting research, the UK Biobank is making a significant contribution as the most useful source of biological information (Fig. 1). It is clear that it is making a significant contribution (Figure 1).
The policy of access to information has a significant impact on the system of conducting research, as seen from the information on the authors of the academic papers in which the final products are published. A survey was conducted using Web of Science, a database of academic literature, regarding original papers that include the names of UK Biobank, China Kadoorie Biobank, and Tohoku Medical Megabank Organization, which have published many academic papers, in the title, abstract, and keywords. Although the UK is the author of the largest number of papers related to the UK Biobank, about half of the papers did not include a researcher from the UK, where the biobank is located, as an author. On the other hand, in papers related to the China Kadoorie Biobank, a collaboration between Peking University and Oxford University, researchers from China or the UK are included as authors in the majority of papers because of the preferential access period from Chinese institutions before accepting data access from abroad, Since the Tohoku Medical Megabank Organization in Japan also does not grant access to data from researchers outside Japan, Japanese researchers are included as authors in all relevant papers, reflecting the differences in biobank policies on the use of samples and information5). The number of papers reported by each biobank on its website and the number of papers when searched on the Web of Science differed significantly for biobanks other than UK Biobank, but the most significant factor was that in studies conducted using information from a small number of cases with relatively early publication years, the biobank's The most significant factor was that many papers were not extracted in the Web of Science because the name of the biobank was not used in the text of the paper. Other factors such as omissions in the Web of Science search conditions, differences in publication standards at each biobank, and differences in the immediacy of information reflection made it impossible to cover all papers, but the majority of major papers using large-scale data from each biobank were eligible for tabulation under the current extraction conditions. However, since most of the major papers using the large-scale data of each biobank were included in the tabulation under the current selection conditions, we present them as reference values to show the trends in the countries of the authors of the papers in each biobank (Table 2).
UK Biobank deliverables
The UK Biobank, in its Open Access principle, requires the publication of research results in order to make the findings of research available to other research. In this paper, we conducted a survey of the literature registered on the Web of Science to understand the current status of biobank research based on the outputs of the UK Biobank, where researchers from all over the world have access to samples and information and where a wide variety of research is being conducted. The survey covered original papers that included "UK Biobank" or "United Kingdom Biobank" in the title, abstract, and keywords. Science, which is more than the 3,207 papers that were available on the UK Biobank website as of January 12, 2023, the same day of the survey. As mentioned above, it takes some time after the publication of a paper to reflect it on the homepage, and the Web of Science includes review papers, etc., but we confirmed that the majority of the papers were reports of research using the UK Biobank information, and thus included them in the survey as papers related to the UK Biobank. The majority of the papers were reports of studies using UK Biobank information and were therefore included in the survey. In this survey, when the country of authorship of a paper was counted, the paper was counted as a paper from that country if at least one author from that country was included in the authors of the paper (for example, a paper written by two authors from the UK and the US was counted as one paper from each country, the UK and the US).
The number of papers related to the UK Biobank over time shows that the number of papers has increased year by year, confirming that many papers have been published along with the accumulation of data in the biobank (Figure 2) 6). The nationalities of the institutions to which the authors belong show that papers related to the UK Biobank have been published not only from the UK but also from various countries, mainly the U.S. and China, indicating that the strategy of open access, which makes samples and information widely available to researchers outside their own country, has greatly contributed to the development of research using biometric data (Fig. 3). This is a clear indication that the open access strategy, which makes samples and information widely available to researchers outside the country, has contributed significantly to the development of research using biological information (Figure 3).
International Collaboration
In the UK Biobank, the information is not shared in the form of collaboration between the biobank and researchers who use the samples and information, but rather in the form of permission for access, This means that papers can be submitted without including the UK Biobank as an author. In this paper, in order to investigate trends in the countries where biobank research is conducted, we compiled data on cross-border collaborations based on the nationality of the institutional affiliations of the authors of papers related to UK Biobank (Figure 4). The distribution of the number of countries of institutional affiliations of the authors of each paper was confirmed by using the number of papers that included at least one researcher affiliated with a research institution in the country as the denominator. In the U.K. and the U.S., where the number of papers was the highest, the distribution of papers written by authors from a single country, two countries, and three or more countries was about 30% each, confirming a similar trend of international collaboration. On the other hand, even in the top countries in terms of the number of papers, the trend of international collaboration for the generation of research results has its own characteristics, such as China, where more than 40% of the papers were authored by a single author, and Australia and Sweden, where only about 10% of the papers were authored by a single author. The characteristics are also evident in Asia, with South Korea, like China, conducting more than 40% of its own research, while in Singapore, the majority of papers are written by authors who have participated in research conducted in multiple countries. No geographical characteristics or clusters were identified when checking the breakdown of coauthor countries for each country (Table 3).
Collaboration with Pharmaceutical Industry
As for the institutional affiliations of the authors of papers related to the UK Biobank, academic institutions in the United Kingdom and the United States dominate the number of papers, and the proportion of papers with private companies including the pharmaceutical industry as authors is small, but the number of papers involving pharmaceutical companies as authors in 2022 has increased to 39, about three times the number of papers in 2019, three years earlier, This is a gradual upward trend. Among pharmaceutical companies, Regeneron Pharmaceuticals (USA) is involved as an author in the largest number of papers, and the company collaborates with UK Biobank to perform exome sequencing of samples accumulated at UK Biobank in its own laboratory. The number of papers involving Regeneron exceeds that of RIKEN, which has reported the largest number of research results in Japan. The number of papers in which Regeneron and 26 other companies ( 25 major pharmaceutical companies7) were involved as authors was 126 out of 4,188, or about 3% of the total number of papers.
Research Contents
An analysis of article titles was conducted to investigate trends in the research content of articles associated with the UK Biobank. The frequency of expressions used in the titles of papers was tabulated using a word-by-word bi-gram (a string consisting of two adjacent words). From the top, "Mendelian Randomization" was found in about 13% of the 555 paper titles, followed by "Randomization Study" and "Cohort Study, Cohort Study" was followed by "Randomization Study" and "Cohort Study" (Table 4). The reason why the phrase "Randomization Study" ranked high, given the nature of biobanks as observational studies that cannot be randomized by intervention, is that the term "Mendelian Randomization" was used to describe a study in which a randomization of the data was performed. Randomization" is the reason why the phrase "Mendelian Randomization Study" is often used in the titles of papers, along with "Mendelian Randomization". Mendelian randomization is a research method that treats genomic information as a control variable to reduce confounding in observational studies, and is a research method that has been increasing in recent years (see below for details). Polygenic Risk Score (PRS)" and "Genetic Risk Score (GRS)" also appear at the top of the list of other research methods, indicating that many papers on genome analysis have been published. This indicates that many papers on genome analysis have been published. In terms of disease areas, COVID-19, cardiovascular disease, type 2 diabetes, and cancer are also high on the list, indicating that these disease areas are being researched extensively using information from the biobank.
UK Biobank samples
In addition to health status, physical measurements, blood, urine, saliva, and other samples and test results, the UK Biobank provides bone densitometry results from X-ray absorptiometry (DXA), MRI and other imaging information, and accelerometer information, making information from a variety of domains available (Table 5) .8) The UK The following table shows the disease domains studied and the use of genomic information, image information, and accelerometers, which are characteristic samples and information of UK Biobank, for 859 papers registered in Web of Science out of 931 papers reported in 2021 that are available on the Biobank website. The results of physical measurements and biochemical tests were also used. Results of physical measurements and biochemical tests were also frequently used information, but were not included in the survey this time because of the wide variety of test items used in each study.
To begin with, one major category of the disease classification table according to ICD-10 was assigned to each article in order to ascertain the distribution of disease areas under study. The most frequently studied disease area among the articles reported in 2021 was cardiovascular diseases (Figure 5). This is consistent with the fact that "Cardiovascular Disease" was one of the most frequently studied disease areas in the biobank study, as it was also one of the top topics in the analysis of article titles in the previous section. Other major categories included mental and behavioral disorders, including depression and dementia; neoplasms, including various cancer-related diseases; endocrine, nutritional, and metabolic diseases, including diabetes mellitus; and COVID-19, which is also generally consistent with the trend in the analysis of article titles. There were also many studies to which no single major category could be assigned, including many papers proposing methods for analyzing genomic information and head MRI.
The genome-related information currently available in the UK Biobank can be broadly classified into three categories: microarray-based SNP typing, whole exome sequencing, and whole genome sequencing9). Of the 859 papers reported in 2021, 64% (547 papers) used genome information, and similar to the trend in the analysis of paper titles, genome-related studies accounted for the majority. related studies accounted for the majority. The data of whole exome and whole genome sequences were released in the middle of 2021, and were used in only 9% (47 papers) of the papers using genome information, but the UK Biobank has announced that the number of samples will increase in the future, and the number of papers using this information is expected to increase in the future. The number of papers using this information is expected to increase in the future. The number of papers analyzed using Mendelian randomization, which was the highest ranking in the analysis of paper titles, was 34% (187 papers) of the papers using genome information, and this survey also confirmed that many studies are being conducted applying information such as variants detected in GWAS.
As for imaging information, MRI images of the head, heart, and abdomen, whole body scans using the DXA method, carotid ultrasound images, and fundus images are provided .10) Among the papers reported in 2021, 16% (140 papers) used imaging information, with the most frequently used information being MRI of the head The most frequently used information was information derived from MRI images of the head at 9% (78 papers) (Figure 6). Mental and behavioral disorders were the most common disease areas for which MRI images of the head were used, at 28% (22 reports). Some of the image information includes not only the actual images, but also numerical information derived from the images, and some of the images are published together with the numerical values. In addition to papers using the images themselves, papers using the numerical information derived from the images were also included in this report. Thus, the creation of an environment that allows secondary use of information obtained in past studies is considered to be one of the factors that increase the frequency of use, as it facilitates the processing of information.
Accelerometer information was used in 2% (20 papers) of the papers. Accelerometers were mainly used in studies that either estimated daytime activity or sleep duration; UK Biobank used similar information to calculate the MET (Metabolic Equivalent Task) score, which is a measure of weekly activity (strenuous exercise, The UK Biobank collected similar information such as time spent in each activity (strenuous exercise, walking, watching TV, using a computer, etc.) and sleeping time through questionnaires to calculate the MET (Metabolic Equivalent Task) score, and there were more studies that used this information than those that used accelerometer information. Accelerometer information is still in the process of being processed, and it tends to be difficult to use it as part of covariates in research at this point.
Countries conducting genomic research using large-scale data
The research results of the UK Biobank, which has accumulated a wide variety of samples and information, indicate that research using genome information is the most significant research area in biobank research. In this study, in GWAS using biobank samples and information, such as investigating the association between genomic information and clinical information such as disease, PRS or GRS were used centrally to estimate the risk of disease by genes, and to estimate causal relationships in observational studies, the GWAS detected It was observed that many studies have been conducted using Mendelian randomization of variants. A brief description of each analysis method is given below with reference to the Tohoku Medical Megabank Organization website.
PRS (Polygenic Risk Score), GRS (Genetic risk score) 11 )
For variants such as single nucleotide polymorphisms that have been suggested to be associated with diseases by GWAS and other methods, a numerical value is obtained by adding up the product of the estimated effect size of high-risk polymorphisms and the number of high-risk polymorphisms that each individual has. It is calculated for each individual, and based on this score, a high or low genetic risk of developing various diseases can be quantitatively evaluated. PRS and GRS are similar analytical methods, but PRS sometimes refers to the inclusion of variants that are less relevant to the disease in the calculation of the score.
Mendelian randomization12)
PRS is a method to reduce confounding in observational studies by taking advantage of the random distribution of variants, such as single nucleotide polymorphisms, and using the variants as control variables. In the Tohoku Medical Megabank Organization study, the risk of colorectal cancer was compared between high and low BMI groups using BMI values predicted from genomic information, rather than actual BMI values influenced by lifestyle and other confounding factors. The use of predicted BMI values in the analysis equalizes background factors between populations and reduces the effects of confounding compared to conventional observational studies.
The tabulations up to the previous section focused only on studies associated with the UK Biobank; the countries in which studies using UK Biobank samples and information were conducted may be increasingly used from Europe, based on the racial distribution of biobank study participants. Therefore, we surveyed countries conducting genomic studies using large data sets, focusing on Mendelian randomization and PRS/GRS, which are currently the most heavily used methods in studies using biobank genomic information We used Web of Science to find relevant (title; abstract, keywords) information for these studies, and found that the most common methods used in these studies were Mendelian randomization, PRS/GRS, and PRS/GRS, We used Web of Science to identify original papers (with words in the title, abstract, and keywords) related to these studies, and found that Mendelian randomization (search terms: "Mendelian Randomization" OR "Mendelian Randomization" OR "Mendelian Randomization") and PRS/GRS (search terms: "Polygenic Risk Score" OR "Genetic Risk Score"). The number of papers related to PRS/GRS (search terms: "Polygenic Risk Score" OR "Genetic Risk Score") has been increasing, confirming that these research methods have been attracting attention in recent years (Figure 7). The nationality of the authors of the papers using each research method was generally the same as that of the UK Biobank-related papers, but the number of papers using PRS/GRS was higher in the U.S. than in the other countries, indicating that each country has its own characteristics in terms of research content. This trend was also true among the studies related to the UK Biobank, with a large number of studies using PRS/GRS being conducted in the United States.
Finally, we review the current status of international collaboration from the papers of studies that used one of the research methods. For papers that include authors from institutions in multiple countries, we checked the combination of countries that are authors in the same paper and used the top 20 combinations in the number of coauthored papers to illustrate the state of collaboration (Figure 8). Although the United States, the United Kingdom, and China are leading genome research using large-scale data, the United States and the United Kingdom are at the center of international collaboration, while China tends to conduct research domestically with relatively little collaboration with other countries. The Netherlands, Germany, and Sweden followed, confirming that a large number of joint research projects are also being conducted among the respective countries. In the number of publications by country/region published by the National Institute of Science and Technology Policy13) , the top three countries in the fields of basic life sciences and clinical medicine are the U.S., China, and the U.K., which is similar to the result of this survey, but the gap between the number of publications from the top four countries and those from the bottom four is large when limited to the areas surveyed in this study. However, when limited to the areas surveyed in this study, there is a large gap with the number of papers in the fourth-ranked country and below. The fact that Nordic countries such as Sweden, Denmark, and Finland are at the top of the ranking is also a slightly different feature from the ranking of basic life sciences and clinical medicine as a whole.
Summary
Biobank research, which began in the late 1990s and has been promoted in many countries, has accumulated data in recent years, and many results using the accumulated information as large-scale data of biological information have been reported. The biobanks' policies for use by pharmaceutical companies and access to information from overseas differ, and it is desirable for biobanks and the pharmaceutical industry to cooperate in discussing various issues related to data management, such as the protection of personal information and methods for sharing large-scale data, in order to make even more effective use of the information. It is desirable for biobanks and the pharmaceutical industry to cooperate in considering various data management issues, such as the protection of personal information and the sharing of large-scale data.
Several pharmaceutical companies in Europe and the U.S. have created a foundation for utilizing large amounts of genome information for drug discovery through joint research with biobanks, and Japanese pharmaceutical companies have started collaboration with biobanks to keep up with the U.S. and Europe. On the other hand, Japan lags behind other countries in the number of publications on research using the UK Biobank, the largest biobank currently available, and research using large-scale genome information. In Japan, the development of domestic biobanks has been progressing, and there is a possibility that many domestic biobanks use domestic information in consideration of racial differences, but there are other countries in similar situations. Considering the fact that it has been only a short period of time since biobank information has become available, the trend is consistent with a report published by the National Institute of Science and Technology Policy that Japan's position in terms of paper production has been declining in recent years, and the field of genome research using large-scale data is no exception. There are concerns that Japan's position in the production of scientific papers has been declining in recent years.) In Japan, biobank research led by universities and hospitals and based on samples and information collected by their own organizations is increasing14), and although Japan is following other countries in terms of data accumulation, the low number of publications is likely due to the fact that Japan lags behind other countries in terms of research that uses such information to produce results. The reason for the low number of papers is likely to be that Japan lags behind other countries in research that uses this information to produce results. In the United States, the United Kingdom, and China, the top three organizations in terms of the number of publications related to the UK Biobank were all universities, but in Japan, RIKEN had the largest number of publications, followed by universities, but the trend was slightly different from the top countries. Looking at the trends in other countries, it is essential to improve research capabilities involving universities in order to increase the number of papers. In order to maintain Japan's international competitiveness in the field of drug discovery, pharmaceutical companies will need to deepen their collaboration with academia and focus on training personnel to handle large-scale data in the life science field.
-
1) Number of reports and countries from which data was obtained
-
2)
-
3)Conroy, Megan, et al. The advantages of UK Biobank's open-access strategy for health research. Journal of Internal Medicine, 2019, 286.4: 286. 389-397.
-
4)
-
5)
-
6)
-
7)As per the 25 largest pharmaceutical companies in the world in the DATA BOOK 2022 (Japan Pharmaceutical Manufacturers Association).
-
8)Littlejohns, Thomas J., et al. UK Biobank: opportunities for cardiovascular research. european heart journal, 2019, 40.14: 1158-1166.
-
9)
-
10)
-
11)
-
12)
-
13)
-
14)
