Chapter 8: Approaches to data collection

8.1. Primary data collection

8.1.1. General considerations

The methodological aspects of studies using primary data collection (also sometimes referred to as field studies or prospective studies) are well covered in the textbooks and guidelines referred to in the Introduction. Annex 1 of Module VIII of the Good pharmacovigilance practice provides examples of study designs based on prospective/primary data collection, such as cross-sectional study, prospective cohort study, and active surveillance. For completeness, surveys and randomised controlled trials are also presented below as examples of primary data collection.

Studies using primary data collection in clinical care or community-based settings have allowed the evaluation of drug-disease associations for rare complex conditions that require very large source populations and in-depth case assessment by clinical experts. Classic historical examples are Appetite-Suppressant Drugs and the Risk of Primary Pulmonary Hypertension (N Engl J Med. 1996;335:609-16), The design of a study of the drug etiology of agranulocytosis and aplastic anemia (Eur J Clin Pharmacol. 1983;24:833-6) and Medication Use and the Risk of Stevens–Johnson Syndrome or Toxic Epidermal Necrolysis (N Engl J Med. 1995;333:1600-8). For some conditions, case-control surveillance networks have been developed and used for selected studies and for signal generation and evaluation, e.g., Signal generation and clarification: use of case-control data (Pharmacoepidemiol Drug Saf 2001;10:197-203).

Data can be collected using paper, electronic case report forms or, increasingly, study-specific smartphone or web applications provided to patients. This approach has been used during the COVID-19 pandemic, as illustrated, for example, in COVID-19 vaccine waning and effectiveness and side-effects of boosters: a prospective community study from the ZOE COVID Study (Lancet Infect Dis. 2022:S1473-3099(22)00146-3): in this longitudinal, prospective, community-based study, data on demographic characteristics, comorbidities, symptoms, SARS-CoV-2 tests and results, and vaccinations, were self-reported through an app, with participants prompted to daily reporting through app notifications. Possibilities, Problems, and Perspectives of Data Collection by Mobile Apps in Longitudinal Epidemiological Studies: Scoping Review (J Med Internet Res. 2021;23(1):e17691) concludes that using mobile technologies can help to overcome challenges linked to data collection in epidemiological research, but the applicability and acceptance of these mobile apps in various subpopulations vary and need to be further studied. In addition, self-reported data may introduce information bias or selection bias, and since participants are self-selected, they might not be fully representative of the general population.

8.1.2. Surveys

The book Research Methods in Education (J. Check, RK. Schutt, Sage Publications, 2011) defines survey research as "the collection of information from a sample of individuals through their responses to questions" (p. 160). This type of research allows for a variety of methods to recruit participants, collects data and utilises various instruments.

A survey is the collection of data on specific health and quality of life aspects, knowledge, attitudes, behaviour, practices, opinions, beliefs, or feelings of selected groups of individuals from a specific sampling frame, by asking them questions in person or by post, phone or online. They generally have a cross-sectional design, but repeated measures over time may be performed for the assessment of trends.

Surveys have long been used in fields such as market research, social sciences and epidemiology. General guidance on constructing and testing the survey questionnaire, modes of data collection, sampling frames and ways to achieve representativeness can be found in general texts (Survey Sampling (L. Kish, Wiley, 1995) and Survey Methodology (R.M. Groves, F.J. Fowler, M.P. Couper et al., 2nd Edition, Wiley 2009). The book Quality of Life: the assessment, analysis and interpretation of patient-related outcomes (P.M. Fayers, D. Machin, 3rd Edition, Wiley, 2016) offers a comprehensive review of the theory and practice of developing, testing and analysing health-related quality of life questionnaires in different settings.

Surveys have an important role in the evaluation of the effectiveness of risk minimisation measures (RMM) or of a risk evaluation and mitigation strategy (REMS) (see Chapter 16.4). The application of methods described in the aforementioned textbooks needs adaptation for surveys to evaluate the effectiveness of RMM or REMS. For example, the extensive methods for questionnaire development of quality of life scales (construct, criterion and content validity, inter-rater and test-retest reliability, sensitivity and responsiveness) are not appropriate to questionnaires for RMM which are often used only once. The EMA and FDA issued guidance documents on the conduct of surveys for risk minimisation (RM) which, together, encompass the selection of risk minimisation measures, study design, instrument development, data collection, processing and data analysis and presentation of results. This guidance include the draft EMA Guideline on good pharmacovigilance practices (GVP) Module XVI – Risk minimisation measures: selection of tools and effectiveness indicators (Rev 3) (2021), the FDA draft guidance for industry REMS Assessment: Planning and Reporting on REMS (2019) and the FDA Guidance on Survey Methodologies to Assess REMS Goals That Relate to Knowledge (2019). A checklist to assess the quality of studies evaluating risk management programs is provided in The RIMES Statement: A Checklist to Assess the Quality of Studies Evaluating Risk Minimization Programs for Medicinal Products (Drug Saf. 2018;41(4): 389-401). The article Are Risk Minimization Measures for Approved Drugs in Europe Effective? A Systematic Review (Expert Opin Drug Saf. 2019;18(5):443-54) highlights the need for improvement in the methods and presentation of results and for more hybrid designs that link survey data with health and safety outcomes as requested by regulators. This article also reports on low response rates found in many studies, allowing for the possibility of important bias. The response rate should therefore be reported in a standardised way in surveys to allow comparisons. Standard Definitions. Final Dispositions of Case Codes and Outcome Rates for Surveys (2016) of the American Association for Public Opinion Research provides standard definitions which can be adapted to RM surveys and the FDA Guidance on Survey Methodologies to Assess REMS Goals That Relate to Knowledge (2019) provides guidance for RM surveys.

An important aspect of surveys is sampling, often using a clustered random sample. However, attention shall be paid to the selection of the original list of subjects in the target population. For example, if the evaluation of the awareness about an educational material is part of the objectives, the same lists which were used to distribute the educational material cannot be used for sampling the survey, otherwise a selection bias cannot be excluded.

The increasing use of online RMM require that survey methods adapt but should not sacrifice representativeness by accessing only populations which visit these websites. They should provide evidence that the results using these sampling methods are not biased. Similarly, the increasing use of healthcare professional and patient panels needs to ensure that survey methods do not sacrifice representativeness by accessing only self-selected participants in these panels and should provide evidence that the results are not biased by using these convenient sampling frames. The influence of information given to survey subjects about the survey prior to its completion should attempt to minimise the influence of this information to reduce bias.

The issue of thresholds to assess the effectiveness of RMM remains a topic of debate. This topic is discussed in the aforementioned EMA and FDA documents and the article Are Risk Minimization Measures for Approved Drugs in Europe Effective? A Systematic Review (Expert Opin Drug Saf. 2019;18(5):443-54). The thresholds need to be viewed in the context of their potential impact on the benefit-risk balance. Composite thresholds for all of three aspects (awareness, knowledge, and behaviour) of RM effectiveness are hardly achieved.

The draft EMA Guideline on good pharmacovigilance practices (GVP) Module XVI – Risk minimisation measures: selection of tools and effectiveness indicators (Rev 3) (2021) encourages the evaluation of process indicators being linked to health outcomes. A holistic evaluation of non-targeted effects as well as product-specific targeted effects has so far been performed in only a minority of studies, as shown in Risk Minimisation Evaluation with Process Indicators and Behavioural or Health Outcomes in Europe: Systematic Review (Expert Opin Drug Saf. 2019;18(5):443-54).

8.1.3. Randomised controlled trials

Randomised controlled trials are an experimental design that involves primary data collection. There are numerous textbooks and publications on methodological and operational aspects of clinical trials which are not covered here. An essential guideline on clinical trials is the European Medicines Agency (EMA) Guideline for good clinical practice E6(R2), which specifies obligations for the conduct of clinical trials to ensure that the data generated in the trial are valid. From a legal perspective, the Volume 10 of the Rules Governing Medicinal Products in the European Union contains all guidance and legislation relevant for conduct of clinical trials. A number of documents are under revision.

The way clinical trials are conducted in the European Union (EU) has undergone a major change when the Clinical Trial Regulation (Regulation (EU) No 536/2014) came into effect and replaced the existing Directive 2001/20/EC.

Hybrid data collection as used in pragmatic trials, large simple trials and randomised database studies are described in Chapter 4.2.7.

8.2. Secondary use of data

Secondary use of data refers to the utilisation of data already collected for other purposes. These data can be further linked to prospectively collected medical and non-medical data. Electronic healthcare databases (e.g., claims databases, electronic health records) and patient registries are examples of data sources that can be leveraged as secondary data for pharmacoepidemiological studies.

The last decades have witnessed the development of key data resources, expertise and methodology that have allowed use of such data for pharmacoepidemiology. The ENCePP Inventory of Data Sources contains information on existing European and worldwide databases that may be used for pharmacoepidemiological research. However, this field is continuously evolving.

A description of the main features, applications and limitations of frequently used electronic healthcare databases for pharmacoepidemiology research in the United States and in Europe is presented in the textbook Pharmacoepidemiology (B. Strom, S.E. Kimmel, S. Hennessy. 6th Edition, Wiley, 2019, Chapters 11-14).

In order to assist in the selection and appropriate use, including the assessment of strengths and limitations, of data sources for pharmacoepidemiological research, the ISPE-endorsed Guidelines for Good Database Selection and use in Pharmacoepidemiology Research (Pharmacoepidemiol Drug Saf. 2012;21(1):1-10) highlights potential limitations of data sources for secondary use containing routinely collected healthcare information, such as electronic health records (from either primary or secondary care) and claims databases, and recommends procedures for data analysis and interpretation. A section of the guideline is dedicated to multi-database studies which may be defined as “studies using at least two healthcare databases, which are not linked with each other at an individual person level, either because they insist on different populations, or because, even if populations overlap, local regulations forbid record linkage” (see Chapter 9). References to data quality and validation procedures, data processing/transformation, and data privacy and security (see Chapter 12.2) are also provided. In Different Strategies to Execute Multi-Database Studies for Medicines Surveillance in Real-World Setting: A Reflection on the European Model (Clin Pharmacol Ther. 2020;108(2):228-235), four strategies to conduct multi-database studies are discussed (see also Chapter 9). Specific processes have also been proposed to identify fit-for-purpose data sources to address research questions. For example, The Structured Process to Identify Fit-For-Purpose Data: A Data Feasibility Assessment Framework (Clin Pharmacol Ther. 2022;111(1):122-34) provides a structured and detailed stepwise approach for the identification and feasibility assessment of candidate data sources for a specific study. In order to help signpost regulators, researchers, industry and evidence reviewers to the relevant data sources to address a research question, the joint EMA-EU Heads of Medicines Agency Big Data Steering Group has also published a list of metadata (2022) describing data sources and studies and defined following extensive consultation of interested parties. This list will be used in the rebuilding and enhancement of the ENCePP Inventory of Data sources. The experience will show how such initiatives can support the validity and transparency of study results and ultimately the level of confidence in the evidence provided. It should also be acknowledged that many investigators naturally use the data source(s) they can directly access and are familiar with.

The FDA Best Practices for Conducting and Reporting Pharmacoepidemiologic Safety Studies Using Electronic Health Care Data Sets (2013) provides criteria for best practice that apply to the study design, analysis, conduct and documentation. It emphasizes that investigators should understand the potential limitations of electronic healthcare data systems, make provisions for their appropriate use and refer to validation studies of outcomes of interest in the proposed study and captured in the database. This is also covered in the UK MHRA guidance on the use of real-world data in clinical studies to support regulatory decisions (2021). Guidance for conducting studies within electronic healthcare databases can also be found in the International Society for Pharmacoepidemiology Guidelines for Good Pharmacoepidemiology Practices (ISPE GPP, 2015), in particular sections IV-B (Study conduct, Data collection). This guidance emphasises the importance of patient data protection.

The use of real-world data (RWD) for the generation of real-world evidence (RWE) for regulatory decision-making has been addressed by guidelines issued by regulatory agencies. The article Real-World Data for Regulatory Decision Making: Challenges and Possible Solutions for Europe (Clin Pharmacol Ther. 2019;106(1):36-9) describes the operational, technical and methodological challenges for the acceptability of RWD for regulatory purposes and presents possible solutions to address these challenges. The draft FDA guidance Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision-Making for Drug and Biological Products (2021) provides recommendations focused on regulatory studies using electronic health records and claims databases, and a more general draft guidance provides Considerations for the Use of Real-World Data and Real-World Evidence To Support Regulatory Decision-Making for Drug and Biological Products (December 2021). More information on RWD and RWE are available in Chapter 16.7, Real-world evidence and pharmacoepidemiology.

The Joint ISPE-ISPOR Special Task Force Report on Good Practices for Real‐World Data Studies of Treatment and/or Comparative Effectiveness (2017) recommends good research practices for designing and analysing retrospective databases for comparative effectiveness research (CER) and reviews methodological issues and possible solutions for CER studies based on secondary data analysis (see also Chapter 16.1). Many of the principles are applicable to studies with other objectives than CER, but some aspects of pharmacoepidemiological studies based on secondary use of data, such as data quality, ethical issues, data ownership and privacy, are not covered.

Most of the examples and methods covered in Chapter 4 are based on studies and methodologic developments concerning secondary use of healthcare databases, since this is one of the most frequent approaches used in pharmacoepidemiology.

8.3. Patient registries

8.3.1. Definitions

A patient registry is defined in the EMA’s Guideline on registry-based studies (2021) as an organised system that collects uniform data (clinical and other) to identify specified outcomes for a population defined by a particular disease, condition or exposure. It should be considered as an infrastructure for the standardised recording of data from routine clinical practice on individuals identified by a characteristic or an event.

A registry-based study is the investigation of a research question using the data collection infrastructure or patient population of one or several patient registries. A registry-based study may be a non-interventional study or a clinical trial.

8.3.2. General guidance on patient registries

The EMA’s Guideline on registry-based studies (2021) includes an Annex discussing several aspects of good practice considered relevant for the use of registries for registry-based studies and other possible regulatory purposes. It addresses the registry population, data elements, quality management, governance and data sharing.

The EMA’s Scientific Advice Working Party issued a Qualification Opinion for several registry platforms, including the ECFSPR for cystic fibrosis, the EBMT for blood & marrow transplantation and the Enroll HD for Huntington disease, with an evaluation of their potential use as data sources for registry-based studies in specific regulatory contexts. These opinions provide an indication of the methodological components expected by regulators for using a disease registry for such studies.

The US Agency for Health Care Research and Quality (AHRQ) published ‘good registry practices’ under the title Registries for Evaluating Patient Outcomes: A User's Guide, 4th Edition (2020), which provide comprehensive methodological guidance on planning, design, implementation, analysis, interpretation and evaluation of the quality of a registry.

The FDA issued the draft guidance Real-World Data: Assessing Registries to Support Regulatory Decision-Making for Drug and Biological Products Guidance for Industry (2021). This guidance provides sponsors (marketing authorisation applicants and holders) and other relevant stakeholders with considerations when proposing to design a registry, or when using an existing registry to support regulatory decision-making about a drug's effectiveness or safety.

8.3.3. Types of patient registries

The characteristic or event defining entry into a patient registry may be the diagnosis of a disease (disease registry), the occurrence of a condition or event (e.g., pregnancy registry), a birth defect (e.g., birth defect registry), a molecular or a genomic feature, or any other patient characteristics.

The term product registry has been used for a data collection system where data are collected on patients exposed to a particular medicinal product, single substance or therapeutic class in order to evaluate their use or their effects. Such system should rather be considered a clinical trial or non-interventional study, as data is collected for the purpose of a specific pre-planned analysis in line with performing a trial/study. Moreover, it does not include specific aspects related to the use of patient registries as source population or existing data collection system.

The terms population registry or register have been used to describe the type of registries that exist in European Nordic countries. In these countries, a comprehensive registration of data covering the entire population allows linkage between different patient registries that may include hospital encounters, diagnoses and procedures, such as the Norwegian Patient Registry, the Danish National Patient Registry or the Swedish National Patient Register. Review of 103 Swedish Healthcare Quality Registries (J Intern Med. 2015; 277(1): 94–136) describes healthcare ‘quality’ registries initiated mostly by Swedish physicians that focus on specific disorders. Data recorded may include aspects of disease management, self-reported quality of life, lifestyle and general health status and provide an important data source for research.

8.3.4. Registry-based studies

As outlined in Imposed registries within the European postmarketing surveillance system (Pharmacoepidemiol Drug Saf. 2018;27(7):823-26) and the EMA’s Guideline on registry-based studies (2021), there are important methodological differences between the registries and the conduct of registry-based studies. Patient registries are often integrated into routine clinical practice with systematic and sometimes automated data capture in electronic healthcare records. A registry-based study may only use the data relevant for the specific study objectives, is often limited in time and may need to be enriched with additional information on outcomes, lifestyle data, immunisation or mortality information. Such information may be obtained from linkage to existing databases such as national cancer registries, prescription databases or mortality records.

Results obtained from analyses of registry data may be affected by the same biases as those of studies described in Chapter 5 of this Guide. Factors that may influence the enrolment of patients in a registry may be numerous (including clinical, demographic and socio-economic factors) and difficult to predict and identify. This will potentially result in a biased sample of the patient population in case the recruitment has not been exhaustive. Bias may also be introduced by differential completeness of follow-up and data collection.

As illustrated in The randomized registry trial--the next disruptive technology in clinical research? (N Engl J Med. 2013; 369(17): 1579-81) and Registry-based randomized controlled trials: what are the advantages, challenges and areas for future research? (J Clin Epidemiol. 2016;80:16-24), and more recently in Registry randomised trials: a methodological perspective (BMJ Open 2023; 13(3)), randomised registry-based trials may support enhanced generalisability of findings, rapid consecutive enrolment, and the potential completeness of follow-up for the reference population, when compared with conventional randomized effectiveness trials. Defining key design elements of registry-based randomised controlled trials: a scoping review (Trials 2020;21(1):552) concludes that the low cost, reduced administrative burden and enhanced external validity make registries an attractive research methodology to be used to address questions of public health importance. However, the issues of data integrity, completeness, timeliness, validation and adjudication of endpoints need to be carefully addressed.

8.3.5. Interoperability between registries

A complexity of using registry data for regulatory purposes and analyses is the need for interoperability between different registries covering a same disease or condition. In most cases, there is no global alignment on how to collect data (data format, expression of a variable) in registries and often no mandatory standards to be applied for the data collected (content/variables). Interoperability of disease registries has been addressed in several workshops on disease-specific registries organised by EMA. The reports of these workshops are available on the EMA Patient registries initiative website. They describe the expectations from different stakeholders on common data elements to be collected and the best practices on topics such as governance, data quality control, data sharing or reporting of safety data.

One way to approach the challenge of heterogeneity between registries is the adaptation of globally common data structures in preparing registry data for joint analyses. One example is the Observational Medical Outcomes Partnership common data model (OMOP CDM) of the Observational Health Data Sciences and Informatics - OHDSI group. The OMOP CDM was originally designed for electronic healthcare records and claims data representing the majority of the 331 data sources from 34 countries. Data mapped to the OMOP CDM in January 2022, as stated by OHDSI in Our Journey (p. 36), resulted in 810 million unique patient records. Registry data is only slowly getting introduced to the OMOP CDM.

8.3.6. Registries which capture special populations

Special populations can be identified based on age (e.g., birth, paediatric or elderly), pregnancy status, renal or hepatic function, race, or genetic differences. Some registries are focused on these particular populations.

For paediatric populations, specific and detailed information as neonatal age (e.g., in days), pharmacokinetic parameters and organ maturation need to be considered and is usually missing from traditional data sources, therefore paediatric-specific registries are important. The Guideline on good pharmacovigilance practices (GVP) Product- or Population-Specific Considerations IV: Paediatric population (2018) provides further relevant information. An example of registry which focuses on paediatric patients is Pharmachild, which captures children with juvenile idiopathic arthritis undergoing treatment with methotrexate or biologic agents.

Pregnancy registries include pregnant women followed until the end of pregnancy and provide information on pregnancy outcomes. Use of pregnancy registries for observational studies on adverse effects of medicinal products administered during pregnancy are often faced with multiple challenges, which may vary from registry to registry. They include not only the recruitment and retention of pregnant women, but also the identification of relevant control groups for comparisons and the complete recording of information on pregnancy outcomes. Embryonic and early foetal loss are often not recognised or recorded and data on the gestational age at which these events occur are often missing. Non-interventional studies may therefore require linkage with data captured in birth defects registries, teratology information services or electronic health care records where mother-child linkage is possible. The EMA Draft Guideline on good pharmacovigilance practices. Product- or Population-Specific Considerations III – Pregnancy prevention programme and other pregnancy-specific risk minimisation measures (2022) provides methodological recommendations for use of a pregnancy registry for data collection in additional pharmacovigilance activities. The FDA Draft Postapproval Pregnancy Safety Studies Guidance for Industry (2019) include recommendations for designing a pregnancy registry with a description of research methods and elements to be addressed. The Systematic overview of data sources for drug safety in pregnancy research (2016) provides an inventory of pregnancy exposure registries and alternative data sources on safety of prenatal drug exposure and discusses their strengths and limitations. Examples of population-based registries allowing to assess outcome of drug exposure during pregnancy are the European network of registries for the epidemiologic surveillance of congenital anomalies EUROCAT, the EUROmediSAFE inventory of data sources, and the pan-Nordic registries which record drug use during pregnancy as illustrated in Selective serotonin reuptake inhibitors and venlafaxine in early pregnancy and risk of birth defects: population based cohort study and sibling design (BMJ 2015;350:h1798).

In the context of rare diseases, the European Reference Networks (ERNs), consisting of 24 virtual networks composed of healthcare providers across Europe, aim to facilitate discussion on such complex diseases and conditions that require highly specialised treatment, and concentrated knowledge and resources. One of the purposes of the European Rare Disease Research Coordination and Support Action consortium (ERICA), in which all 24 networks take part, is to build on the strengths of the individual ERNs and create a platform that integrates all ERNs research and innovation capacity. Various activities intend to advance the development and integration of ERN-wide rare disease registries and their utilisation for joint research initiatives. The Network supports the creation of biorepositories within and across ERNs, and promotes the use of the European Platform on Rare Diseases Registration (EU RD Platform) for research.

Other registries that focus on special populations can be found in the ENCePP Inventory of data sources, in the European Platform on Rare Diseases Registration (EU RD Platform), and in Orphanet.

8.3.7. Disease registries in regulatory practice and health technology assessment

Use of real-world data (RWD), including registry data, to support regulatory decision-making is a topic of high interest. Several studies have evaluated the frequency and usefulness of information based on RWD in marketing authorisation applications, but did not present results stratified by data source. The article Marketing Authorization Applications Made to the European Medicines Agency in 2018-2019: What was the Contribution of Real-World Evidence? (Clin Pharmacol Ther. 2022;111(1):90-7) shows that registries were the most common type of RWD sources referred to in marketing authorisation applications and extensions of indications submitted to the EMA in 2018 and 2019 (60.3% and 46.4% respectively of the medicinal products presented with RWE). The follow-up study described in Contribution of Real- World Evidence in European Medicines Agency’s Regulatory Decision Making (Clin Pharmacol Ther. 2023;113(1):135-151) provides an in-depth review of real-world evidence (RWE) submitted in recent centralised applications in the EU, illustrated by examples of RWE contribution to regulatory decision-making.

The article Patient Registries: An Underused Resource for Medicines Evaluation: Operational proposals for increasing the use of patient registries in regulatory assessments (Drug Saf. 2019;42(11):1343-51) proposes sets of measures to improve use of registries in relation to: (1) nature of the data collected and registry quality assurance processes; (2) registry governance, informed consent, data protection and sharing; and (3) stakeholder communication and planning of benefit-risk assessments. The EMA’s Guideline on registry-based studies (2021) discusses methodological aspects for the use of registries for conducting registry-based studies and recommends performing a feasibility assessment of the suitability of a registry for a specific research question to facilitate early discussions with regulators. The use of registries to support the post-authorisation collection of data on safety and effectiveness of medicinal products in the routine treatment of diseases is also discussed in the EMA Guideline on good pharmacovigilance practices (GVP) – Module VIII -Post-authorisation safety studies (2017) and the EMA Scientific guidance on post-authorisation efficacy studies (2016).

As outlined in Real World Data in Health Technology Assessment of Complex Health Technologies - PMC (Front Pharmacol. 2022; 13: 837302), incorporating data from clinical practice into the drug development process is of growing interest for Health Technology Assessment (HTA) bodies and payers since reimbursement decisions can benefit from better estimation and prediction of effectiveness of treatments at the time of product launch. An example where registries can provide clinical practice data is the building of predictive models that incorporate data from both randomised clinical trials (RCTs) and registries to generalise results observed in RCTs to a real-world setting. In this context, the EUnetHTA Joint Action 3 project has issued the Registry Evaluation and Quality Standards Tool (REQueST) aiming to guide the evaluation of registries for effective use in HTA.

Patient experience data collected through patient registries can inform medicine development, enhance regulatory decision-making, and result in more patient-relevant outcomes to study, however, the generation of these data remains challenging as highlighted for example in A review of patient-reported outcomes used for regulatory approval of oncology medicinal products in the European Union between 2017 and 2020 (Front Med (Lausanne). 2022 Aug 12;9:968272).

8.3.8. Registry catalogues

Several data source catalogues provide different levels of access to different amounts of information on disease registries, such as the ENCePP Resource database of data sources, the EHDEN data partners listing or the EMIF Catalogue. The European Platform on Rare Diseases Registration (EU RD Platform) serves as a platform for information on registries for rare diseases and has developed a harmonised set of common data elements for rare disease registration.

In the context of the EMA/HMA Big Data Initiative, the ENCePP Resource database of data sources and the EU PAS Register will be enhanced and replaced in 2024 by two new catalogues of RWD sources and non-interventional studies in view of facilitating the identification by regulators, researchers and pharmaceutical companies of data sources and studies suitable to address research questions, based on the FAIR (findable, accessible, interoperable and reusable) data principles.

8.4. Spontaneous reports

Note: Chapter 8.4. (formerly 7.4.) has not been updated for Revision 11 of the Guide, as contents remain up-to-date.

Spontaneous reports of suspected adverse drug reactions remain a cornerstone of pharmacovigilance and are collected from a variety of sources, including healthcare providers, national authorities, pharmaceutical companies, medical literature, and directly from patients.

EudraVigilance is the European Union data processing network and management system for reporting and evaluating suspected adverse drug reactions (ADRs). Other major systems for collections of spontaneous reports are the FDA's Adverse Event Reporting System (FAERS), the FDA’s Vaccine Adverse Event Reporting System (VAERS) and the WHO global database of individual case safety reports, VigiBase, that pools reports from the members of the WHO programme for international drug monitoring. These systems deal with the electronic exchange of Individual Case Safety Reports (ICSRs), the early detection of possible safety signals and the continuous monitoring and evaluation of potential safety issues in relation to reported ADRs. Spontaneous case reports represent the first line of evidence and the majority of safety signals is based on them, as described in A description of signals during the first 18 months of the EMA pharmacovigilance risk assessment committee (Drug Saf. 2014;37(12):1059-66).

The main strengths of spontaneous reporting systems are:

i) they cover all types of authorised medicines used in any setting (primary, secondary and specialised healthcare) and all reasons for use including authorised indications, off-label, misuse and abuse;

ii) they are built to obtain information specifically to evaluate the likelihood that a particular treatment is the cause of an observed adverse event. The data collection concentrates on variables relevant to this objective directing reporters towards careful coding and communication of the main aspects of an ADR (e.g., event dates, medical history and co-morbidities, concomitant treatments, etc.);

iii) they are designed to collect and make the information on suspected ADRs rapidly available for analysis.

The application of knowledge discovery in databases to post-marketing drug safety: example of the WHO database (Fundam Clin Pharmacol. 2008;22(2):127-40) describes known limitations of spontaneous ADR reporting systems, which can be grouped into four main categories:

i) factors influencing reporting dynamics, whereby known or unknown factors, such as workload of healthcare professionals or increased media coverage and public awareness, may influence the reporting rate, leading respectively to under-reporting or to a comparative increase in the reporting rate affecting the reliability of estimates of signals of disproportionate reporting;

ii) insufficient clinical information reported, not allowing a satisfactory case evaluation and/or the identification of possible risk factors, which is crucial to establish the likely causal relationship between exposure to the product and occurrence of the adverse drug reaction;

iii) misclassification of diagnosis is closely related to the factors influencing reporting dynamics, where extensive media coverage and public awareness not only stimulates reporting, but may influence the interpretation of symptoms, such that symptoms similar to the ones of the disorder in the media coverage, are likely to be reported as suspected cases of that disorder to the detriment of other disorders with similar symptoms, potentially leading to a misclassification of diagnosis;

iv) lack of collection of control information, as these databases are case-only databases and thus cannot provide actual medicinal product exposure information nor information on the disease incidence.

Another challenge of spontaneous reporting databases is the quality of the information provided and adherence to reporting rules; for this reason, comprehensive and multi-faceted quality activities are often an integral part of these systems (see Detailed guide regarding the EudraVigilance data management activities by the European Medicines Agency Rev 1 for an example). One aspect of the data quality activities regards report duplication. Duplicates are separate and unlinked records that refer to one and the same case of a suspected ADR and may mislead signal assessment or distort statistical screening. They are generally detected by individual case review of all reports or by computerised duplicate detection algorithms. In Performance of probabilistic method to detect duplicate individual case safety reports (Drug Saf. 2014;37(4):249-58) a probabilistic method applied to VigiBase highlighted duplicates that had been missed by a rule-based method and also improved the accuracy of manual review. In the study, however, a demonstration of the performance of de-duplication methods to improve signal detection is lacking. The EMA and FDA have also implemented probabilistic duplicate detection in their databases.

More recently, there have been attempts to boost the computerised detection of duplicates using Natural Language Processing (NLP) techniques to identify similarities on the narrative of reports, as demonstrated in Using Probabilistic Record Linkage of Structured and Unstructured Data to Identify Duplicate Cases in Spontaneous Adverse Event Reporting Systems (Drug Saf. 2017;40(7):571–82).

For the above reasons, it is advised that the cases underlying a potential safety signal from spontaneous reports should be verified from a clinical perspective and preferably supported by pharmacological information before further investigation. Anecdotes that provide definitive evidence (BMJ. 2006;333(7581):1267-9) describes uncommon examples where this is not necessary, where strong and well documented spontaneous reports can be convincing to support the existence of a signal.

Patient reporting is an important source of suspected adverse drug reactions. Factors affecting patient reporting of adverse drug reactions: a systematic review (Br J Clin Pharmacol. 2017;83(4):875-83) describes the practical difficulties with patient reporting and highlights the patients’ motivation to make their ADRs known to prevent similar suffering in other patients. The value of patient reporting to the pharmacovigilance system: a systematic review (Br J Clin Pharmacol. 2017;83(2):227-46) concludes that patient reporting adds new information and perspective about ADRs in a way otherwise unavailable, and this can contribute to better regulatory decision-making. Patient Reporting in the EU: Analysis of EudraVigilance Data (Drug Saf. 2017;40(7):629-45) also concludes that patient reporting complements reporting by health care professionals and that patients are motivated to report especially those ADRs that affect their quality of life.

The information collected in spontaneous reports is a reflection of a clinical event that has been attributed to the use of one or more suspected medicinal products. Although the majority of information provided in the ICSRs is coded, the description of the clinical event, as well as the interpretation of the reporter, contains valuable information for signal detection purposes. Examples are the description of timing and course of the reactions, of the presence or absence of additional risk factors and of the medical history of the patient. Knowledge of the local healthcare system, its corresponding guidelines and the possibilities to follow-up for more detailed information are considered important during this review.

Since only part of this information is coded and can be used in statistical analyses, it remains important to review the underlying cases for signal detection purposes.

The increase in systematic collection of ICSRs in large electronic databases has allowed the application of data mining and statistical techniques for the detection of safety signals (see Chapter 11). Validation of statistical signal detection procedures in EudraVigilance post-authorisation data: a retrospective evaluation of the potential for earlier signalling (Drug Saf. 2010;33(6): 475-87) shows that the statistical methods applied in EudraVigilance can provide significantly early warning in a large proportion of drug safety concerns. Nonetheless, this approach should supplement, rather than replace, other pharmacovigilance methods.

The report ‘Characterisation of Databases (DBs) Used for Signal Detection (SD): Results of a Survey of IMI PROTECT Work Package (WP) 3 Participants’ (Pharmacoepidemiol Drug Saf. 2012;21(Suppl.3): abstract no. 496 pp 233) shows the heterogeneity of spontaneous databases and the lack of comparability of signal detection methods employed.

Chapters IV and V of the Report of the CIOMS Working Group VIII ‘Practical aspects of Signal detection in Pharmacovigilance’ present sources and limitations of spontaneously-reported drug-safety information and databases that support signal detection. Appendix 3 of the report provides a list of international and national spontaneous reporting system databases.

Finally, in EudraVigilance Medicines Safety Database: Publicly Accessible Data for Research and Public Health Protection (Drug Saf. 2018;41(7):665-75), the authors describe how these databases, focusing on EudraVigilance, have been made more easily accessible for external stakeholders. This has allowed to provide better access to information on suspected adverse reactions for healthcare professionals and patients, and opportunities for health research for academic institutions.

8.5. Social media

Note: Chapter 8.5. (formerly 7.5.) has not been updated for Revisions 10 and 11 of the Guide.

8.5.1. Definition

Technological advances have dramatically increased the range of data sources that can be used to complement traditional ones and may provide compelling insights into or relevant to effectiveness and safety of health interventions such as medicines and their risk minimisation measures, benefit-risk communications and related stakeholder engagement. Such data include those from digital media that exist in a computer-readable format and can be extracted from websites, web pages, blogs, vlogs, social networking sites, internet forums, chat rooms and health portals. A recent addition to the digital media data is biomedical data collected through wearable technology (e.g., heart rate, physical activity and sleep pattern, dietary patterns). These data are unsolicited and generated in real time.

A subset of digital media data are social media data. The European Commission’s Digital Single Market Glossary defines social media as “of Web 2.0 and that allow the creation and exchange of user-generated content. It employs mobile and web-based technologies to create highly interactive platforms via which individuals and communities share, co-create, discuss, and modify user-generated content.”

8.5.2. Use in pharmacovigilance

Social media content analyses have been used to provide insights into patients’ perceptions of the effectiveness and safety of medicines and for the collection of patient reported outcomes, as discussed in Web-based patient-reported outcomes in drug safety and risk management: challenges and opportunities? (Drug Saf. 2012;35(6):437-46).

The IMI WEB-RADR European collaborative project explored different aspects related to the use of social media data for pharmacovigilance and summarised its recommendations in Recommendations for the Use of Social Media in Pharmacovigilance: Lessons From IMI WEB-RADR (Drug Saf 2019;42(12):1393-407). The French Vigi4Med project, which evaluated the use of social media, mainly web forums, for pharmacovigilance activities, published a set of recommendations in Use of Social Media for Pharmacovigilance Activities: Key Findings and Recommendations from the Vigi4Med Project (Drug Saf. 2020;43(9):835-51).

A further possible use of social media data would be as a source of information for signal detection or assessment. Studies including Using Social Media Data in Routine Pharmacovigilance: A Pilot Study to Identify Safety Signals and Patient Perspectives (Pharm Med. 2017;31(3): 167-74) and Assessment of the Utility of Social Media for Broad-Ranging Statistical Signal Detection in Pharmacovigilance: Results from the WEB-RADR Project (Drug Saf. 2018;41(12):1355–69) evaluated whether analysis of social media data (specifically Facebook and Twitter posts) could identify pharmacovigilance signals early, but in their respective settings, found that this was not the case.

The study Using Social Media Data in Routine Pharmacovigilance: A Pilot Study to Identify Safety Signals and Patient Perspectives (Pharm Med. 2017;31(3): 167-74) also tried to determine the quantity of posts with resemblance to adverse events and the types and characteristics of products that would benefit from social media content analysis. It concludes that, although analysis of data from social media did not identify new safety signals, it can provide unique insight into the patient perspective.

From a regulatory perspective, social media is a source of potential reports of suspected adverse reactions and marketing authorisation holders are legally obliged to screen websites under their management and assess whether reports of adverse reactions qualify for spontaneous reporting (see Good Pharmacovigilance Practice Module VI, section VI.B.1.1.4.). Principles for continuous monitoring of the safety of medicines without overburdening established pharmacovigilance systems and a regulatory framework on the use of social media in pharmacovigilance have been proposed in Establishing a Framework for the Use of Social Media in Pharmacovigilance in Europe (Drug Saf. 2019;42(8):921-30).

Sentiment analyses of social media content may offer future opportunities for regulators into public perceptions about the safety of medicines and trustworthiness of regulatory bodies. This can inform and evaluate specific safety communication strategies aiming at effective and safe use of medicines. For example, a recent study provided insight into public sentiments about vaccination of pregnant women by stance, discourse and topic analysis of social media posts in ‘‘Vaccines for pregnant women?! Absurd” – Mapping maternal vaccination discourse and stance on social media over six months (Vaccine 2020;38(42): 6627-38).

8.5.3. Challenges

While offering the promise of new research models and approaches, the rapidly evolving social media environment presents many challenges including the need for strong and systematic processes for data selection and validation, and study implementation. Articles which detail associated challenges are: Evaluating Social Media Networks in Medicines Safety Surveillance: Two Case Studies (Drug Saf. 2015;38(10): 921-30.) and Social media and pharmacovigilance: A review of the opportunities and challenges (Br J Clin Pharmacol. 2015;80(4): 910-20).

There is currently no defined strategy or framework in place in order to meet the standards around data selection and validity and methods for data analysis, and their regulatory acceptance may therefore be lower than for traditional sources. However, more tools and methods for analysing unstructured data are becoming available, especially for pharmacoepidemiology and pharmacovigilance research, as in Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts (J Am Med Inform Assoc. 2017 Feb 22), Social Media Listening for Routine Post-Marketing Safety Surveillance (Drug Saf. 2016;39(5):443-54) and Social Media Research (Chapter 11 in Communicating about Risks and Safe Use of Medicines, Adis Singapore, 2020, pp 307-332). However, the recognition and disambiguation of references to medicines and adverse events in free text remains a challenge and performance evaluations need to be critically assessed as discussed in Prospective Evaluation of Adverse Event Recognition Systems in Twitter: Results from the Web-RADR Project (Drug Saf. 2020;43(8):797-808).

8.5.4. Data protection

The EU General Data Protection Regulation (GDPR) introduces EU-wide legislation on personal data and security. It specifies that the impact of data protection at the time of study design concept should be assessed and reviewed periodically. Other technical documents may also be applicable such as Smartphone Secure Development Guidelines (2017) published by the European Network and Information Security Agency (ENISA), which advises on design and technical solutions. The principles of these security measures are found in the European Data Protection Supervisor (EDPS) opinion on mobile health (Opinion 1/2015 Mobile Health-Reconciling technological innovation with data protection.