Chapter 5: Definition and validation of drug exposure, outcomes and covariates

Note: except for minor text edits, Chapter 5 (formerly 4.3) has not been updated for Revision 11 of the Guide, as contents remain up-to-date.

Historically, pharmacoepidemiological studies relied on patient-reported information or paper-based health records. The rapid increase in access to electronic healthcare records and large administrative databases has changed the way exposures and outcomes are defined, measured and validated. All variables in secondary data sources should be defined with care taking into account the fact that information is often recorded for purposes other than pharmacoepidemiology. Secondary data originate mainly from four types of data sources: prescription data (e.g., UK CPRD primary care data), data on dispensing (e.g., PHARMO outpatient pharmacy database), data on payment for medication (namely claims data, e.g., IMS LifeLink PharMetrics Plus), data collected in surveys, and data from specific means of data collection (e.g., pregnancy registries, vaccine registries). Misclassification of exposure, outcome or any covariate, or incorrect categorisation of these variables, may lead to information bias (see Chapter 6).

5.1. Assessment of exposure

Exposure definitions can include simple dichotomous variables (e.g., ever vs. never exposed) or be more granular, including estimates of duration, exposure windows (e.g., current vs. past exposure) also referred to as risk periods, or dosage (e.g., current dosage, cumulative dosage over time). Consideration should be given to both the requirements of the study design and the availability of variables. Assumptions made when preparing drug exposure data for analysis have an impact on results: an unreported step in pharmacoepidemiological studies (Pharmacoepidemiol Drug Saf. 2018;27(7):781-8) demonstrates the effect of certain exposure assumptions on findings and provides a framework to report preparation of exposure data. The Methodology chapter of the book Drug Utilization Research. Methods and Applications (M. Elseviers, B. Wettermark, A.B. Almarsdottir et al. Ed. Wiley Blackwell, 2016) discusses different methods for data collection on drug utilisation.

The population included in these data sources follows a process of attrition: drugs that are prescribed are not necessarily dispensed, and drugs that are dispensed are not necessarily ingested. In Primary non-adherence in general practice: a Danish register study (Eur J Clin Pharmacol 2014;70(6):757-63), 9.3% of all prescriptions for new therapies were never redeemed at the pharmacy, with different percentages per therapeutic and patient groups. The attrition from dispensing to ingestion is even more difficult to measure, as it is compounded by uncertainties about which dispensed drugs are actually taken by the patients and the patients’ ability to provide an accurate account of their intake.

5.2. Assessment of outcomes

A case definition compatible with the data source should be developed for each outcome of a study at the design stage. This description should include how events will be identified and classified as cases, whether cases will include prevalent as well as incident cases, exacerbations and second episodes (as differentiated from repeat codes) and all other inclusion or exclusion criteria. If feasible, prevalent cases should not be included. The reason for the data collection and the nature of the healthcare system that generated the data should also be described as they can impact on the quality of the available information and the presence of potential biases. Published case definitions of outcomes, such as those developed by the Brighton Collaboration in the context of vaccine studies, are useful but not necessarily compatible with the information available in observational data sources. For example, information on the onset or duration of symptoms, or clinical diagnostic procedures, may not be available.

Search criteria to identify outcomes should be defined and the list of codes and any used case finding algorithm should be provided. Generation of code lists requires expertise in both the coding system and the disease area. Researchers should consult clinicians who are familiar with the coding practice within the studied field. Suggested methodologies are available for some coding systems, as described in Creating medical and drug code lists to identify cases in primary care databases (Pharmacoepidemiol Drug Saf. 2009;18(8):704-7). Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models (Annu Rev Biomed Data Sci. 2018;1:53-68) reports on methods for phenotyping (finding subjects with specific conditions or outcomes) which are becoming more commonly used, particularly in multi-database studies (see Chapters 9 and 16.6). Care should be given when re-using a code list from another study as code lists depend on the study objective and methods. Public repository of codes such as Clinicalcodes.org are available and researchers are also encouraged to make their own set of coding available.

In some circumstances, chart review or free text entries in electronic format linked to coded entries can be useful for outcome identification or confirmation. Such identification may involve an algorithm with use of multiple code lists (for example disease plus therapy codes) or an endpoint committee to adjudicate available information against a case definition. In some cases, initial plausibility checks or subsequent medical chart review will be necessary. When databases contain prescription data only, drug exposure may be used as a proxy for an outcome, or linkage to different databases is required. The accurate date of onset is particularly important for studies relying upon timing of exposure and outcome such as in the self-controlled designs (see Chapter 4.2.3).

5.3. Assessment of covariates

In pharmacoepidemiological studies, covariates use includes selecting and matching study subjects, comparing characteristics of the cohorts, developing propensity scores, creating stratification variables, evaluating effect modifiers and adjusting for confounders. Reliable assessment of covariates is therefore essential for the validity of results. A given database may or may not be suitable for studying a research question depending on the availability of information on these covariates.

Some patient characteristics and covariates vary with time and accurate assessment is therefore time dependent. The timing of assessment of the covariates is an important factor for the correct classification of the subjects and should be clearly reported. Capturing covariates can be done at one or multiple points during the study period. In the latter scenario, the variable will be modelled as time-dependent variable (See Chapter 4.3.3).

Assessment of covariates can be performed using different periods of time (look-back periods or run-in periods). Fixed look-back periods (for example 6 months or 1 year) can be appropriate when there are changes in coding methods or in practices or when using the entire medical history of a patient is not feasible. Estimation using all available covariates information versus a fixed look-back window for dichotomous covariates (Pharmacoepidemiol Drug Saf. 2013; 22(5):542-50) establishes that defining covariates based on all available historical data, rather than on data observed over a commonly shared fixed historical window will result in estimates with less bias. However, this approach may not always be applicable, for example when data from paediatric and adult periods are combined because covariates may significantly differ between paediatric and adult populations (e.g., height and weight).

5.4. Misclassification and validation

5.4.1. Misclassification

The validity of pharmacoepidemiological studies depends on the correct assessment of exposure, outcomes and confounders. Measurement errors, i.e., misclassification of binary or categorical variables or mismeasurement of continuous variables result in information bias. The effect of misclassification in the presence of covariates (Am J Epidemiol. 1980;112(4):564–9) shows that misclassification of a confounder results in incomplete control for confounding.

Misclassification of exposure is non-differential if the assessment of exposure does not depend on the true outcome status and misclassification of outcome is non-differential if the assessment of the outcome does not depend on exposure status. Misclassification of exposure and outcome is considered dependent if the factors that predict misclassification of exposure are expected to also predict misclassification of outcome.

Misconceptions About Misclassification: Non-Differential Misclassification Does Not Always Bias Results Toward the Null (Am J Epidemiol. 2022; kwac03) emphasises that bias towards the null is not always “conservative” but might mask important safety signals and discusses seven exceptions to the epidemiologic ‘mantra’ about non-differential misclassification bias resulting in estimates towards the null. One important exception is outcome measurement with perfect specificity which results in unbiased estimates of the risk ratio.

The influence of misclassification on the point estimate should be quantified or, if this is not possible, its impact on the interpretation of the results should be discussed. FDA’s Quantitative Bias Analysis Methodology Development: Sequential Bias Adjustment for Outcome Misclassification (2017) proposes a method of adjustment when validation of the variable is complete. Use of the Positive Predictive Value to Correct for Disease Misclassification in Epidemiologic Studies (Am J Epidemiol. 1993;138(11):1007–15) proposes a method based on estimates of the positive predictive value which requires validation of a sample of patients with the outcome only, while assuming that sensitivity is non-differential and has been used in a web application (Outcome misclassification: Impact, usual practice in pharmacoepidemiological database studies and an online aid to correct biased estimates of risk ratio or cumulative incidence; Pharmacoepidemiol Drug Saf. 2020;29(11):1450-5) which allows correction of risk ratio or cumulative incidence point estimates and confidence intervals for bias due to outcome misclassification based on this methodology. The article Basic methods for sensitivity analysis of biases (Int J Epidemiol. 1996;25(6):1107-16) provides different examples of methods for examining the sensitivity of study results to biases, with a focus on methods that can be implemented without computer programming. Good practices for quantitative bias analysis (Int J Epidemiol. 2014;43(6):1969-85) advocates explicit and quantitative assessment of misclassification bias, including guidance on which biases to assess in each situation, what level of sophistication to use, and how to present the results.

5.4.2. Validation

Common misconceptions about validation studies (Int J Epidemiol. 2020;49(4): 1392-6) discusses important aspects on the design of validation studies. It stresses the importance of stratification on key variables (e.g., exposure in outcome validation) and shows that by sampling conditionally on the imperfectly classified measure (e.g., case as identified by the study algorithm), only the positive and negative predictive values can be validly estimated.

Most database studies will be subject to outcome misclassification to some degree, although case adjudication against an established case definition or a reference standard can remove false positives, while false negatives can be mitigated if a broad search algorithm is used. Validity of diagnostic coding within the General Practice Research Database: a systematic review (Br J Gen Pract. 2010:60:e128 36), the book Pharmacoepidemiology (B. Strom, S.E. Kimmel, S. Hennessy. 6th Edition, Wiley, 2012) and Mini-Sentinel's systematic reviews of validated methods for identifying health outcomes using administrative and claims data: methods and lessons learned (Pharmacoepidemiol Drug Saf. 2012;supp1:82 9) provide examples of validation. External validation against chart review or physician/patient questionnaire is possible in some instances but the questionnaires cannot always be considered as ‘gold standard’. Misclassification of exposure should also be measured based on validation, as feasible.

Linkage validation can be used when another database is used for the validation through linkage methods (see Using linked electronic data to validate algorithms for health outcomes in administrative databases, J Comp Eff Res. 2015;4:359-66). In some situations, there is no access to a resource to provide data for comparison. In this case, indirect validation may be an option, as explained in the textbook Applying quantitative bias analysis to epidemiologic data (Lash T, Fox MP, Fink AK. Springer-Verlag, New-York, 2009).

Structural validation of the database with internal logic checks should also be performed to verify the completeness and accuracy of variables. For example, one can investigate whether an outcome was followed by (or proceeded from) appropriate exposure or procedures or if a certain variable has values within a known reasonable range.

While the positive predictive value is more easily measured than the negative predictive value, a low specificity is more damaging than a low sensitivity when considering bias in relative risk estimates (see A review of uses of health care utilization databases for epidemiologic research on therapeutics; J Clin Epidemiol. 2005;58(4):323-37).

For databases routinely used in research, documented validation of key variables may have been done previously by the data provider or other researchers. Any extrapolation of a previous validation study should however consider the effect of any differences in prevalence and inclusion and exclusion criteria, the distribution and analysis of risk factors as well as subsequent changes to health care, procedures and coding, as illustrated in Basic Methods for Sensitivity Analysis of Biases, (Int J Epidemiol. 1996;25(6):1107-16).