Points of View Efforts to Ensure Data Quality in Europe ~Fit for purpose data to promote secondary use

Printable PDF

Junya Tsujii, Senior Researcher, National Institute of Biomedical Innovation Policy

SUMMARY

In Europe, data quality based on data characteristics is defined as fitness for purpose of data users, and data must reflect reality. To ensure the quality of healthcare data, a data quality framework based on the concept of fit for purpose has been published in Europe by TEHDAS (Towards a European Health Data Space) and the EMA (European Medicines Agency) and HMA (European Heads of Medicines Regulation). The TEHDAS (Towards the European Health Data Space), EMA (European Medicines Agency) and HMA (European Medicines Regulatory Agency) have published a data quality framework based on the concept of data quality. In Europe, the establishment of a standardized metadata catalog is also being considered so that data users can appropriately judge the quality of data. As one specific initiative, the EMA-HMA has published the world's first guide providing metadata catalog recommendations for real-world data used in pharmaceutical regulatory decision-making. In addition, to make these efforts effective, the EMA-HMA confirmed that financial incentives are being considered for data owners and managers to take into account the costs of data collection and secondary use.

1. Introduction

In the future of medical health care, people will use all data acquired from birth to manage and improve their health status even before they become ill, and will receive early detection of disease, appropriate personalized treatment intervention, and prognostic care. In addition, health and medical data acquired in various ways will provide support for people to stay healthier, leading to optimal policy decisions, innovations in medical technology including pharmaceuticals, and optimal allocation of medical resources at any given time.

The key to the realization of such a future is "health and medical data," and in recent years, Japan has been developing and studying institutional policies to promote the utilization of health and medical data. The "Roadmap for the Promotion of Medical DX," released on June 2, 2023, aims to establish a framework for information sharing and utilization of personal health records (PHRs) to further improve the health of the public and efficiently provide high-quality medical care without gaps in care1). In addition, in order to support better secondary use by industry, etc., the Next Generation Medical Infrastructure Act was amended2) (enacted in May 2023) to include the use of pseudonymized medical information in applications for pharmaceutical approval, and the Plan for Implementation of Regulatory Reform3) (approved by the Cabinet in June 2023), which indicates the need to consider legislation for the use of medical and other data that does not necessarily depend on explicit consent. (approved by the Cabinet in June 2023), which indicates the need to consider legislation for the utilization of medical and other data that does not necessarily rely on explicit consent. On the other hand, although it should be noted that there may be differences in what is intended by "quality," there are several references in the trade press to further securing the quality and quantity of health and medical data4), 5).

Thus, although institutional policies are being developed to accelerate data utilization, especially secondary use, we believe that the study from the perspective of the quality and quantity of health and medical data actually utilized is not yet sufficient. In light of the above, this report provides an overview of European initiatives that have been referred to by several organizations, including the Japan Pharmaceutical Manufacturers Association, as a reference for the utilization of healthcare data in Japan, and examines measures to promote the secondary use of data in Japan from the perspective of ensuring data quality.

2) European Trends for Ensuring Data Quality

2-1. What is the European Health Data Space (EHDS)?

The European Health Data Space (EHDS) is a common European data space that promotes the provision of high-quality healthcare through the cross-border use of information for citizens in Europe (primary use), healthcare policy, medical research, drug discovery, etc. (secondary use). The EHDS is a common European dataspace concept that promotes the provision of quality healthcare (primary use), healthcare policy, medical research, drug discovery, etc. ( secondary use) through the cross-border use of information for the citizens of Europe. (The economic benefits are expected to be approximately EUR 11 billion over 10 years6) (including EUR 5.4 billion in cost reductions and efficiency gains in the healthcare sector through increased telemedicine penetration, and EUR 3.4 billion in cost reductions in the reuse of healthcare data by establishing data access mechanisms through Health Data Access Bodies (HDABs; for details, see the Supplement) in each country8). EUR 3.4 billion8), etc.).

The EHDS is built on the European General Data Protection Regulation (GDPR) regarding data protection. However, Article 9 of the GDPR gave member states unique discretion in handling genetic data, biometric data, and health-related data, which resulted in differences in implementation and interpretation among member states, barriers to actual data use. Furthermore, differences in data quality have been pointed out as a barrier to cross-border data sharing .9) The EHDS is being considered as a common European data ecosystem to address these issues, and efforts are underway with the aim of becoming law in 2024 and coming into effect in 2025.

Supplement: Cross-border Data Utilization Flow

Figure 1 shows the flow of cross-border data utilization based on the EHDS.

 Figure 1: Flow of Data Utilization Across Borders

Health and medical data held by domestic medical institutions, etc. are exchanged across borders via a central platform, with National Contact Points (NCPs) established in each member country serving as contact points. The infrastructure connecting the NCPs in each country to the central platform varies depending on the purpose of use, with MyHealth@EU used for primary use and HealthData@EU used for secondary use. The MyHealth@EU is currently in operation in only 11 countries, but the majority of EU and European Economic Area (EEA) member countries are expected to join by 20257). 7) In terms of secondary use, data may be used only when the Health Data Access Bodies (HDABs) established in each country determine that it meets the purposes of use described in Article 34 of the EHDS, and only after processing the data in a manner that does not identify individuals. Article 41 of the EHDS also imposes obligations on data holders, including the obligation to make data available and to provide data upon request within a set timeframe. In addition, the HDAB is obligated to make publicly available and easily retrievable the legal basis on which access is granted and the results of the project in which the data was used.

What is "data quality"?

What is data "quality"? The author believes that data quality can be broadly divided into two categories (Figure 2). One is data standardization and structuring, which refers to storing data collected by different institutions and for different purposes in the same interoperable format. The other is the expression of characteristics of the data itself, such as reliability and consistency of the data. In Japan, discussions on data standardization and structuring, such as the standardization of electronic medical records, are already underway10), and this paper considers data quality from the perspective of "characteristics of data itself. Therefore, "data quality" in this paper refers to quality based on data characteristics unless otherwise specified.

 Figure 2 The author's classification of data quality

For a definition of data quality focusing on data characteristics, 11) please refer to the report of TEHDAS (Towards the European Health Data Space), which is a collaboration of various organizations and associations (public authorities, academia, medical academia, medical societies, and government agencies) in 21 EU member states and four other European countries. Among the 8 work packages in TEHDAS, Work Package 6 (WP6): Excellence in data quality In "Recommendations on a Data Quality Framework for the European Health Data Space for secondary use13) " published by WP6 in September 2023, data quality is defined as "the quality of the data, the quality of the data, the quality of the data, the quality of the data, the quality of the data, the quality of the data, the quality of the data, and the quality of the data. defines data quality as "meeting the needs of data users (health research, policymaking, and regulation). For reference, the "European Health Data Space Data Quality Framework14) " also released by WP6 in May 2022, defines the quality of data used in the EHDS as "fit for purpose to user needs related to health research, policymaking, and regulation. The quality of data to be used in the EHDS should be "fit for purpose to user needs for health research, policymaking, and regulation, and reflect the reality that the data seek to represent. In light of the above, we believe that data quality based on data characteristics is defined by the degree to which the data is fit for purpose for the data users (private companies, government agencies, academia, etc.), i.e., the data must also reflect reality as a precondition. In addition to "fitness for purpose," data quality must also reflect reality. The "Data Quality Framework for EU medicines regulation" jointly released in September 2022 by the EMA (European Medicines Agency) and the HMA (Heads of Medical Affairs Council of the European Medicines Agency), which will be discussed later, includes a definition based on the TEHDAS proposal released in May 2022. 15) (Figure 3).

 Figure 3 Relationship between EHDS and TEHDAS/EMA-HMA Data Quality Framework

2-3 Determinants of Data Quality

The determinants of health care data quality will be considered from the Data Quality Framework published by TEHDAS and EMA-HMA. Although there is no direct reference within the TEHDAS framework to the data to be covered, it is likely that medical data, genomic and omics data, electronic health data from clinical trials, etc. (Article 33. 16 of the EHDS ) ) will be covered, given that the realization of the EHDS is in mind. On the other hand, in addition to medical data and real world data (hereafter RWD), the EMA-HMA framework mentions bioanalytical omics data, preclinical data, adverse event spontaneous reporting data, and chemical and manufacturing control data as areas of particular importance. The focus is on the lowest possible level of data granularity, i.e., the value level (specific data points).

2-3-1. TEHDAS: Recommendations on a Data Quality Framework for the European Health Data Space for secondary use

Data Quality Assurance at the Data Set Level

In measuring data quality, it is necessary to specify the determinants of quality. The determinants of quality are called dimensions (Dimension), a measure of a measurable data characteristic that represents one or more relevant aspects or features of reality. In addition to data quality at the dataset level, the framework also focuses on data utility and specifies dimensions for these in order to facilitate a fit-for-purpose approach by data users. Utility is the quality of data that depends on ex-ante and ex-post conditions centered on the data user, and can be evaluated before data use according to some of the following dimensions. On the other hand, ex-post, usefulness can be measured by measures of use (actual use), interest (actual query), and value (evaluation of the provided data by the data user) of the dataset in question, and is evaluated based on the fulfillment of potential expectations of the user specific to a particular purpose.

The following six specific dimensions of data quality and usefulness are listed (Table 1).

  • (1) Cancer Prevention Research
    Relevance
  • (ii)
    Accuracy and Reliability
  • (iii)
    Coherence
  • iv.
    Coverage
  • (v)
    Completeness
  • (vi)
    Timeliness
 Table 1 TEHDAS framework: dimensions related to data quality (quality and usefulness)

Relevance, Accuracy, Reliability, and Coherence are dimensions related to quality, while Coverage, Completeness, and Timeliness are related to usefulness. The report recommends that they should be included in the framework as dimensions related to quality, while Coverage, Completeness, and Timeliness are related to usefulness. It is also recommended that data users provide feedback on the quality and usefulness of the data set in order to correct data errors. It should be noted, however, that as per the definition of data quality, the key point is "fitness for purpose," and it is assumed that the criteria to be reached in each dimension will differ for each intended use.

Data quality assurance at the data holder level

Under Article 33 of the EHDS, data holders (medical institutions, research institutions, public authorities, EU institutions, etc.) are required to make electronic data available for secondary use. the TEHDAS framework13) states that, from the perspective of data quality, electronic health records (EHRs) In the case of data collected for purely medical purposes, as in the case of EHRs (especially when data sets are regularly updated and consolidated), data holders must establish procedures for data quality management and assurance, with data quality management being the entire data life cycle (Figure 4) and data quality assurance being the data management process ( It is recommended that data quality management should be applied across data management processes (monitoring, incident detection and resolution, data enhancement, etc.).

 Figure 4 Data life cycle related to data quality management

Data quality management procedures (data governance) should be automated according to the maturity level of the data holder. Maturity is a measure of the extent to which a data holder's (organization's) actions, practices, and processes can reliably and sustainably produce the required results, and is primarily assessed by the Capability Maturity Model (CMM). The CMM divides maturity into the following five levels, depending on the actual ability of data holders to continuously improve data quality.

  • (1) Cancer Prevention Research
    Initial: Poorly documented, ad hoc (non-systematic) data quality checks are performed
  • (ii)
    Repeatable: data quality control procedures are well documented and the same procedures can be repeated
  • (iii)
    Defined: Data quality control procedures are well defined and implemented as a standard process
  • iv.
    Managed: Data quality management process includes quantitative measures of quality
  • (v)
    Optimised: Data quality management implies a deliberate process of optimization and continuous improvement.

The above maturity model is supposed to be used as a benchmark against which data holders (organizations) can be compared; TEHDAS recommendations recommend self-assessment by data holders themselves at the "Initial" maturity level and external audit and certification at subsequent levels. Furthermore, it is important to design incentives for data holders in order to ensure continuous improvement and upgrading.

2-3-2. EMA-HMA: Data Quality Framework for EU medicines regulation17 )

The Data Quality Framework published by the Big Data Task Force, co-founded by the EMA-HMA, characterizes the quality of data used for pharmaceutical regulatory decision-making and provides definitions, principles, evaluation procedures, etc. that can be applied to a wide range of data sources by stakeholders for evaluation18). The Framework has been developed with reference to the TEHDAS Framework (European Health Data Space Data Quality Framework) published in May 2022 and taking into account feedback from a wide range of stakeholders involved in the EMA, HMA and TEHDAS. The EMA-HMA framework has the following five dimensions (Table 2), which, compared to the TEHDAS dimensions, are expressed as Reliability for Accuracy and Reliability, and are combined with Completeness for Coverage and Completeness. Extensiveness is defined as the dimension that combines Coverage and Completeness.

  • (1) Cancer Prevention Research
    Reliability
  • (ii)
    Extensiveness
  • (iii)
    Coherence
  • iv.
    Timeliness
  • (v)
    Relevance
 Table 2 EMA-HMA framework: dimensions related to data quality

Looking at the details of these dimensions, there are related sub-dimensions for each of the four dimensions except Relevance. For example, in Reliability, a dimension that evaluates "how accurately the data reflect the measurement intent," the sub-dimensions Precision, Accuracy, and Plausibility are used to answer the question of how well the data match reality. Plausibility is an indicator of how well the data represent the facts. Precision is a measure of how representative of the facts the data are, and varies across metrics (e.g., age or months). Accuracy is also a measure of the size of the discrepancy between data and facts, and varies, for example, depending on whether the weight value subtracts the weight of clothing. Furthermore, validity, defined as the likelihood that some information is true, concerns the presence or absence of errors that are unlikely (or improbable) to occur in the real world, such as when the weight exceeds 300 kg in the majority of cases in the entire data set or when pregnancy records include men19).

However, data generation and collection methods, etc. are not uniform, and dimensional evaluation must take into account the influence of factors that contribute to data quality. In this framework, these factors (determinants) are classified into the following three categories.

  • (1) Cancer Prevention Research
    Underlying determinants
    Factors related to the data generation process (requirements definition - collection and generation - management and processing - publication - acquisition and aggregation - inspection and acceptance - provision) and systems, independent of the content of the data set.
  • (ii)
    Endogenous Determinants
    Inherent characteristics of a particular data set that are independent of the conditions under which the data is generated or used (e.g., number of decimal places).
  • (iii)
    Issue-specific determinants
    Factors that depend on a specific issue (objective).

In general, the basic determinants are organized as having a direct impact on data quality. For example, reliability depends on the process or system of primary collection of data and its processing, while extensiveness is influenced by the specification of the data collection process. For example, reliability depends on the primary collection of data and the processes and systems used to process it; extensiveness is influenced by the specifics of the data collection process; consistency, in the case of data from a single organization, depends on the synchronization of processes and systems across that organization; when multiple data sources are integrated, it depends on the commitment of the data generating organization to use data standards; and timeliness is determined by the processes and systems used to collect and make available the data. The timeliness is determined by the processes and systems used to collect and make the data available.

Thus, in order to properly measure data quality, the dimensions to be evaluated (including sub-dimensions) and acceptable thresholds should be set according to the objectives of data users, and the evaluation should take into account the determinants that affect the dimensions.

2-3-3 Data Interoperability

Although outside the scope of this paper, the TEHDAS "European Health Data Space Data Quality Framework" mentions that interoperability is a prerequisite for high-quality secondary use of data, but is not considered a critical function for quality. Similarly, the EMA-HMA Framework Similarly, in the EMA-HMA Framework, interoperability is considered out of scope, along with aspects that do not directly influence regulatory decision-making (simplicity, accessibility, etc.) and standardization. However, the TEHDAS "Recommendations on a Data Quality Framework for the European Health Data Space for secondary use" states that data governance for effective secondary use of data should include the following The TEHDAS "Recommendations on a Data Quality Framework for the European Health Data Space for secondary use" requires data holders to implement interoperability (semantic interoperability (meaning of exchanged data and information is maintained and understood through exchange between parties) and structural interoperability (structural maintenance of literal names, standard abbreviations, encodings, etc.).

2-4. judgment of quality by data users

2-4-1. EHDS: Use of Metadata Catalog

In order for secondary users of data, such as the pharmaceutical industry, to use data based on the concept of fit for purpose, a mechanism is required that allows secondary users to judge the quality of data in advance. One such mechanism is a "metadata catalog. Metadata is descriptive data that characterize other data (to describe data characteristics) and includes information that allows data users to assess data quality, such as the source and scope of the data set, key characteristics, nature of health data and conditions for making electronic health data available. Cataloguing and publishing this information means making datasets programmatically searchable and providing data that are appropriate for the user's purposes.

In order to develop an interoperable metadata catalog at the European level, the EHDS stipulates that the HDAB, which serves as the secondary use contact point in each country, must inform data users about available data sets and their characteristics through a metadata catalog (Article 55) 16), 20). This is expected to enable data users to find the data they seek appropriately and easily. In addition, Article 56 of the EHDS (Data quality and utility label) states that labels may be required to clarify the characteristics and potential utility of data sets so that data users can appropriately select data sets for their intended use, Data quality and utility labels that conform to the following elements are being considered.

  • (1) Cancer Prevention Research
    Data documentation: metadata, supporting documentation, data model, data dictionary, standards used, provenance
  • (ii)
    Technical quality: completeness, uniqueness, accuracy, validity, timeliness, consistency of data
  • (iii)
    Quality control processes: maturity of data quality control processes, including review, audit processes, and bias validation
  • iv.
    Scope: representativeness of electronic health data across multiple domains, representativeness of sampled populations, and average timeframe in which a single natural person appears in the data set
  • (v)
    Information on access and provision: time from collection of electronic health data to its addition to the data set; time to provision of data after approval of an electronic health data access application
  • (vi)
    Information on data enrichment: integration and addition of data to existing data sets, including links to other data sets

Note that language differences are an issue in the EHDS philosophy of cross-border data sharing, and the metadata catalog is no exception. In cases where information on data held is only available in the local language, measures are being considered to ensure that data users in each country can understand the information by creating it in English or translating the metadata catalog using automated tools (e.g., eTranslation, the European Commission's machine translation system). 21).

Thus, it is essential for data holders to organize and disclose information on data quality in a standardized manner in order for data users to appropriately select data suited to their own purposes of use. Data holders need to provide necessary information for the publication of data-related information through metadata catalogues, etc. On the other hand, data holders need to be aware that their responses to these requests may be small. On the other hand, it will be a burden for the data holders to deal with such a situation. The EHDS also stipulates a "fee" that allows data holders and HDABs to charge data users a fee that takes into account the costs of data collection and secondary use (Figure 5). However, the current regulations are only conceptual, and a detailed fee structure (e.g., fixed or free fees) and the response of each country must be designed at the European level to avoid any bias. As a foothold for this, TEHDAS (WP5: Sharing data for health) is discussing clarification of pricing rules. Specifically, setting prices based on the full-cost principle and the European Commission's provision of a common European model for data use contracts between HDABs and data users are being considered.

 Figure 5 Fee design for data holders and HDABs in EHDS

2-4-2. TEHDAS: Approach to Metadata Catalogue Development

TEHDAS's "Recommendations on a Data Quality Framework for the European Health Data Space for secondary use" presents a three-stage approach to the development of a metadata catalog. The first phase focuses on collecting high-level information on available data sets, regardless of field or data type (e.g., using tools such as DCAT, an international metadata cataloging standard recommended by the World Wide Web Consortium (W3C) 22). (utilization of tools). The second step provides further detailed information on the quality and usefulness of the data set considering various secondary use purposes, and the third step focuses on information based on the actual content (variable level) of the data source (e.g., Beacon (Global Alliance for Genomics & The development of specific practices such as those described above is underway for the publication of data characteristics through a metadata catalog as stipulated in Article 55 of the EHDS. The data user is required to submit the data in two ways.

In addition, data users are required to disclose the results of secondary use of data through the HDAB (Article 46 of the EHDS). In addition, TEHDAS has recommended that data users should consider the need to enrich their data sets and return digital objects (data models, annotations, algorithms, etc.) in a form that allows for their reuse13).

2-4-3. EMA-HMA: Utilization of Metadata Catalogue in Pharmaceutical Regulations

In September 2022, the EMA and the HMA jointly published the first edition of the Good Practice Guide for the use of the Metadata Catalogue of Real-World Data Sources24), 25). This is the first guide in the world to provide recommendations to regulators, researchers, and other stakeholders on the use of RWD metadata for drug regulatory decision-making, with the aim of "facilitating the discovery of data sources for generating appropriate evidence for regulatory purposes" and The guide is intended to "facilitate the discovery of data sources for generating appropriate evidence for regulatory purposes" and "provide rapid access to information on the suitability of the data sources used to support the evaluation of research protocols, etc.". In other words, while the EMA-HMA Data Quality Framework referred to in section 2-3-2 defines a framework for assessing data quality, this Guide provides a visualization method (standardized data catalog format26) ) for assessing quality. There is no legal obligation to register data sources in a metadata catalog, except in certain circumstances. However, if data sources are expected to be used for public health or drug regulatory purposes, data holders are encouraged to register and update their data source information, as the absence of public information about data sources may affect scientific credibility and public trust in research results.

The Guide states that in assessing the adequacy of a data source, it is necessary to distinguish between the aspects of "reliability" and "pertinence" in terms of data quality.

  • (1) Cancer Prevention Research
    Quality regarding reliability of primary data
    e.g., detection and correction of errors, missing or unrealistic values, formatting, etc., characteristics of the data source that are independent of the specific study objectives
  • (ii)
    Quality related to the pertinence of data sources that provide adequate and valid evidence to inform a specific research question through epidemiological and statistical methods
    e.g., availability of data needed for the study, number of subjects, population characteristics, data duration, etc., characteristics of data sources that are dependent on the research purpose

Furthermore, specific items of metadata needed to assess data quality are also described in this Guide (Table 3). In addition, the Guide provides examples of metadata catalog use cases from several user perspectives (e.g., identification of appropriate data sources for research, evaluation of research protocols, and review of research reports) to help users envision specific uses. The metadata catalog is expected to be used to provide information for initial assessment of the reliability of data sources and for initial assessment of the applicability of data sources for generating valid evidence for a specific research question.

The metadata catalog based on this guide will replace the existing catalog by the European Network of Centres for Pharmacoepidemiology and Pharmacovigilance, which is scheduled to be released in late 2023. In the future, the pharmaceutical industry will need to evaluate data sources published by other data holders and consider how to publish metadata catalogs of data sources held by pharmaceutical companies themselves, in order to make use of data for pharmaceutical regulatory purposes.

 Table 3 EMA-HMA: Metadata characterizing data sources

Efforts in the Pharmaceutical Industry

The European pharmaceutical industry is also taking steps to ensure data quality: at the "Multi-stakeholder workshop on Real World Data (RWD) quality and Real World Evidence (RWE) use27) held by the EMA in June 2023, the EMA presented a paper on "Data Quality Assurance in the European Pharmaceutical Industry" (EMA, 2023). use27), the European Federation of Pharmaceutical Industries and Associations (EFPIA) presented the results of its own assessment of the dimensions proposed in the EMA-HMA Data Quality Framework28). Specifically, the Registry Evaluation and Quality Standard Tool (REQueST) 30) developed by EUnetHTA29) was used to test the feasibility of evaluating the EMA-HMA-proposed dimensions (not applicable because the applicability is a research-specific dimension). evaluate the dimensions of the Data Quality Framework is complex in a decentralized registry with indirect access to patient-level data elements". Based on these findings, EFPIA makes three recommendations to the EMA on how to improve data quality assessment and documentation to implement the Data Quality Framework, with the aim of more efficient data quality determination and to coordinate the sustainability of the work of all stakeholders (regulators, registry organizations, industry) (1) lead a dialogue, (2) consider guidance on the minimum information that should be available, and (3) lead the joint development and piloting of tools in line with the Data Quality Framework (Figure 6).

 Figure 6 EFPIA's recommendations for the EMA-HMA Data Quality Framework

Trends in Japan to Ensure Data Quality

From here, we would like to look at trends in Japan related to data quality assurance.

In March 2022, the Digital Agency released the Data Quality Management Guidebook31). This guidebook aims to form a consensus on data quality indicators in Japan and realize an environment that enables data holders, including public and private sector data holders, to provide high-quality data, and provides a framework and evaluation model for data quality management with an eye to data collaboration with overseas countries. The "data quality assessment model" proposed in this guidebook includes both the assessment of the data itself and the management process. For the evaluation of data itself, 15 dimensions (accuracy, completeness, etc.) based on ISO/IEC 25012, an international standard, are set, and for the management process, in addition to data quality planning, quality control, quality assurance, and quality improvement, operational systems (data-related support and resource regulations) are also subject to evaluation. In addition, the Metadata Implementation Practical Guidebook, also published by the Digital Agency in March 2022, provides examples of metadata catalog items that should be assigned to data sets mainly from government agencies22). However, these guidebooks are not specific to the medical healthcare domain.

In the medical healthcare domain, we believe that biobank initiatives are ahead of their time. Currently, a cross-searchable system of samples and information from the Japanese biobank network (9 sites as of June 2023) has been established, which stores information (information related to biobank collaborators, samples, and analysis) for users to determine conformity with their research objectives32). In addition, CANNDs (a platform to promote the utilization of data obtained from AMED-supported research and development) provides a search system that can narrow down data for analysis based on attribute information (metadata: age, gender, birthplace/place of residence, disease name) associated with genome data in the Visiting analysis environment. The platform for the promotion of genome research is currently studying a search system that can narrow down data to be analyzed based on attribute information (metadata: age, sex, birthplace/place of residence, disease name ) linked to genome data in the visiting analysis environment33).

In terms of the system, the WG for the Next Generation Medical Infrastructure Act, which was revised and enacted in May of this year, has indicated "how to make it easier for users to search for and use information, such as by releasing data catalogs" 34). It is proposed that information on data held by authorized business operators that perform anonymous processing, etc., be published as a catalog, but specific items are still under discussion. On the other hand, the Cabinet Office's Biotechnology Strategy has published a "Guidebook for the Linkage and Utilization of Biodata," which aims to foster a common understanding among related parties to promote the linkage and utilization of biodata35), 36). In this guidebook, examples of metadata format items that contribute to data organization and visualization (e.g., research project information, information on data/databases, etc.) are provided. However, it should be noted that the "bio-field" includes not only the medical healthcare domain, but also a wide range of domains such as the agricultural domain.

While the above efforts are underway in Japan, we believe that further study is needed to develop a data quality framework and metadata catalog specifically for the medical healthcare domain.

Conclusion and Discussion

The TEHDAS and EMA-HMA data quality frameworks are based on the following criteria: Relevance, Accuracy/Reliability, Consistency, Coherence, and Timeliness. In addition to Relevance, Accuracy/Reliability, Coherence, and Timeliness, the TEHDAS stipulated Coverage and Completeness, while the EMA-HMA stipulated Extensiveness as dimensions involved in determining data quality. On the other hand, it is important that standardized metadata describing the characteristics of data be made publicly available in order for data users to appropriately select data suitable for their own purposes of use, and in Europe, the establishment of a common metadata catalog for member countries and a metadata catalog focusing on data used in pharmaceutical regulations are being considered. In Europe, the establishment of a common metadata catalog for member states and a metadata catalog focusing on data used in pharmaceutical regulations are being considered. In order to promote efforts to ensure data quality, TEHDAS's "Recommendations on a Data Quality Framework for the European Health Data Space for secondary use" provides a framework to ensure that data holders, HDABs, data users, etc. are legally compliant. and data users (Table 4).

 Table 4: Practices of each stakeholder to ensure data quality (example)

Based on these European efforts, we believe the following two points are important to promote secondary use of data in Japan in terms of data quality,

  • (1) Cancer Prevention Research
    Fostering a common understanding of data quality among stakeholders (e.g., agreement on decision factors, etc.)
  • (ii)
    Standardization of metadata catalogs and development of a public disclosure mechanism to enable data users to select data that meet their purposes of use

We believe that two points are important. The following are the author's thoughts on measures to promote these two points.

(1) Fostering a common understanding of data quality among stakeholders

In order to promote the secondary use of health and medical data based on the concept of "fit for purpose," it is necessary to foster a common understanding among data owners, managers, and users in advance regarding the definition and determinants of data quality and the maturity level of data owners. To this end, it is desirable to establish a study project involving various stakeholders, including the public, medical institutions holding data, academia, the government involved in data management and monitoring, regulatory authorities, data vendors, and pharmaceutical companies as users, to develop guidelines on data quality specific to the medical healthcare domain. It is desirable to establish guidelines on data quality specific to the medical healthcare domain. For example, TEHDAS, mentioned in this paper, involves regulators, national data management organizations, academia, academic societies, private companies, and citizen groups from 21 EU member countries and 4 other European countries. In Japan, it is important to establish a cooperative framework in which a wide range of stakeholders can consider the issue as their own.

Financial support must also be considered. Horizon Europe, a European funding program for research and innovation in various fields, has set up a consortium composed of representatives of data users, data holders, HDAB and other stakeholders relevant to the scope of secondary use of health data. In this context, studies are underway to develop a framework for data quality and usefulness labels as proposed in the EHDS, taking into account the wide range of data types and the burden on data holders, and to optimize the proposed framework through trial runs37). Public funding is also an important aspect of continuing effective studies .

Standardization and publication of metadata catalogs

In order for data users to appropriately select a dataset suitable for their purpose of use, it is desirable to have a standardized metadata catalog with items for judging data quality organized and published. In order to develop a metadata catalog, data characteristic items (information that should be included in the metadata catalog) that take into account the quality required for secondary use purposes should be organized among stakeholders, and necessary information should be collected centrally from data holders to ensure easy access to information by data users and to reduce the burden on data holders. In order to ensure easy access to information by data users and to reduce the burden on data holders, we believe that a public organization (like the HDAB in Europe) should be established to collect, organize, and disclose necessary information from data holders in a centralized manner. (Furthermore, while language differences are not an issue if data is used only within Japan, a mechanism for translation of metadata catalogs should be considered if data is to be used across borders, and there may be merit in HDAB centrally managing metadata catalogs, as is the case in Europe.)

Regarding the specific structure of the metadata catalog for judging quality, it is important for data users, including the pharmaceutical industry, to share with data holders and administrators the specific quality they seek for their respective purposes of use. In other words, it is desirable to present in an easy-to-understand manner what kind of quality data users seek for their specific needs and from what perspective they judge quality, and to set appropriate metadata catalog items after fostering a common understanding among data holders, administrators, and users. The pharmaceutical industry should be required to disseminate information on data quality on the basis of specific use cases, referring to the HMA's efforts.

However, the creation of metadata catalogs imposes a considerable burden on data owners and managers. We believe that Japan also needs to consider incentives (e.g., financial incentives) for the promotion of secondary use of data, including the development of metadata catalogs.

Conclusion

In this paper, we overviewed the efforts in Europe regarding "data quality" for secondary use of health and medical data. In Japan, the Council for Regulatory Reform and various other bodies are discussing the establishment of a Japanese version of the EHDS based on the EHDS. However, at present, discussions focus on the concept, mechanism, and infrastructure of data use, including the purpose of secondary use of data, types of data to be used, and data governance, including whether or not consent is required, and we believe that discussions on data quality are not sufficient. In order to further promote the secondary use of data in Japan, it is essential to ensure the quality of the health and medical data actually utilized, in addition to the development of institutional policies. Referring to the European approach to data quality and the initiatives discussed in this report, Japan will need to take a coordinated national approach to foster a common understanding of data quality among stakeholders and to develop a metadata catalog for users to judge the quality of data.

In addition, while this paper focused on data quality, from the perspective of cross-border use of data, the EHDS being considered in Europe is not only a data space for free data distribution within Europe, but also allows access from institutions outside the region that are recognized as conforming to the standard. Therefore, it can be said that developments related to EHDS will have no small impact on Japan. From the pharmaceutical industry's perspective, if access to Japanese data distribution infrastructure and companies, etc., is permitted, cross-border use of data will advance, and international drug development, including Japan and Europe, will likely be promoted39). On the other hand, if access is not granted to Japan, there is concern that Japan will be left behind other countries in terms of data distribution and utilization. In developing an environment for the utilization of health and medical data in Japan, including data quality, it is essential to take into account international trends such as the EHDS in the data infrastructure, rather than a data distribution system and system closed to Japan.

We hope that this report will help to promote data utilization in Japan. 2.

Share this page

TOP