The STROBE reporting guideline for writing up observational studies in epidemiology

Erik Elm; Douglas G. Altman; Matthias Egger; Stuart J. Pocock; Peter C. Gøtzsche; Jan P. Vandenbroucke

doi:10.1234/equator/1010101

12c. Statistical methods – missing data

What to write

Explain how missing data were addressed.

Explanation

Missing data are common in observational research. Questionnaires posted to study participants are not always filled in completely, participants may not attend all follow-up visits and routine data sources and clinical databases are often incomplete. Despite its ubiquity and importance, few papers report in detail on the problem of missing data^1,2. Investigators may use any of several approaches to address missing data. We describe some strengths and limitations of various approaches in 12c. Statistical methods – missing data . We advise that authors report the number of missing values for each variable of interest (exposures, outcomes, confounders) and for each step in the analysis. Authors should give reasons for missing values if possible, and indicate how many individuals were excluded because of missing data when describing the flow of participants through the study (see also item 13). For analyses that account for missing data, authors should describe the nature of the analysis (e.g., multiple imputation) and the assumptions that were made (e.g., missing at random, see 12c. Statistical methods – missing data).

Missing data: problems and possible solutions

A common approach to dealing with missing data is to restrict analyses to individuals with complete data on all variables required for a particular analysis. Although such ‘complete-case’ analyses are unbiased in many circumstances, they can be biased and are always inefficient³. Bias arises if individuals with missing data are not typical of the whole sample. Inefficiency arises because of the reduced sample size for analysis.

Using the last observation carried forward for repeated measures can distort trends over time if persons who experience a foreshadowing of the outcome selectively drop out⁴. Inserting a missing category indicator for a confounder may increase residual confounding². Imputation, in which each missing value is replaced with an assumed or estimated value, may lead to attenuation or exaggeration of the association of interest, and without the use of sophisticated methods described below may produce standard errors that are too small.

Rubin developed a typology of missing data problems, based on a model for the probability of an observation being missing^3,5. Data are described as missing completely at random (MCAR) if the probability that a particular observation is missing does not depend on the value of any observable variable(s). Data are missing at random (MAR) if, given the observed data, the probability that observations are missing is independent of the actual values of the missing data. For example, suppose younger children are more prone to missing spirometry measurements, but that the probability of missing is unrelated to the true unobserved lung function, after accounting for age. Then the missing lung function measurement would be MAR in models including age. Data are missing not at random (MNAR) if the probability of missing still depends on the missing value even after taking the available data into account. When data are MNAR valid inferences require explicit assumptions about the mechanisms that led to missing data.

Methods to deal with data missing at random (MAR) fall into three broad classes^3,6: likelihood-based approaches⁷, weighted estimation⁸ and multiple imputation^6,9. Of these three approaches, multiple imputation is the most commonly used and flexible, particularly when multiple variables have missing values¹⁰. Results using any of these approaches should be compared with those from complete case analyses, and important differences discussed. The plausibility of assumptions made in missing data analyses is generally unverifiable. In particular it is impossible to prove that data are MAR, rather than MNAR. Such analyses are therefore best viewed in the spirit of sensitivity analysis (see items 12e and 17).

Examples

“Our missing data analysis procedures used missing at random (MAR) assumptions. We used the MICE (multivariate imputation by chained equations) method of multiple multivariate imputation in STATA. We independently analysed 10 copies of the data, each with missing values suitably imputed, in the multivariate logistic regression analyses. We averaged estimates of the variables to give a single mean estimate and adjusted standard errors according to Rubin’s rules”¹¹.

Training

The UK EQUATOR Centre runs training on how to write using reporting guidelines.

Discuss this item

Visit this items’ discussion page to ask questions and give feedback.

References

1.

Tooth L. Quality of reporting of observational longitudinal research. American Journal of Epidemiology. 2005;161(3):280-288. doi:10.1093/aje/kwi042

2.

Vach W, Blettner M. Biased estimation of the odds ratio in case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables. American Journal of Epidemiology. 1991;134(8):895-907. doi:10.1093/oxfordjournals.aje.a116164

3.

Little RJA, Rubin DB. Statistical Analysis with Missing Data. Wiley; 2002. doi:10.1002/9781119013563

4.

Ware JH. Interpreting incomplete data in studies of diet and weight loss. New England Journal of Medicine. 2003;348(21):2136-2137. doi:10.1056/nejme030054

5.

RUBIN DB. Inference and missing data. Biometrika. 1976;63(3):581-592. doi:10.1093/biomet/63.3.581

6.

Schafer JL. Analysis of Incomplete Multivariate Data. Chapman; Hall/CRC; 1997. doi:10.1201/9781439821862

7.

Lipsitz SR, Ibrahim JG, Chen MH, Peterson H. Non-ignorable missing covariates in generalized linear models. Statistics in Medicine. 1999;18(17–18):2435-2448. doi:10.1002/(sici)1097-0258(19990915/30)18:17/18<2435::aid-sim267>3.0.co;2-b

8.

ROTNITZKY A, ROBINS J. ANALYSIS OF SEMI-PARAMETRIC REGRESSION MODELS WITH NON-IGNORABLE NON-RESPONSE. Statistics in Medicine. 1997;16(1):81-102. doi:10.1002/(sici)1097-0258(19970115)16:1<81::aid-sim473>3.0.co;2-0

9.

Rubin DB. Multiple Imputation for Nonresponse in Surveys. Wiley; 1987. doi:10.1002/9780470316696

10.

Barnard J, Meng XL. Applications of multiple imputation in medical studies: From AIDS to NHANES. Statistical Methods in Medical Research. 1999;8(1):17-36. doi:10.1177/096228029900800103

11.

Chandola T, Brunner E, Marmot M. Chronic stress at work and the metabolic syndrome: Prospective study. BMJ. 2006;332(7540):521-525. doi:10.1136/bmj.38693.435301.80

Citation

For attribution, please cite this work as:

von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. The STROBE reporting guideline for writing up observational studies in epidemiology. The EQUATOR Network guideline dissemination platform. doi:10.1234/equator/1010101

12c. Statistical methods – missing data

What to write

Explanation

Missing data: problems and possible solutions

Examples

Training

Discuss this item

References

Citation

Reporting Guidelines are recommendations to help describe your work clearly

Who reads research?

Cohort studies

Case-control studies

Cross-sectional studies

Systematic reviews

Systematic review protocols

TODO

Meta analyses of Observational Studies

TODO

Randomised Trials

Randomised Trial Protocols

TODO

Qualitative research

Case Reports

TODO

Diagnostic Test Accuracy Studies

Prediction Models

Animal Research

TODO

Quality Improvement in Healthcare

Economic Evaluations in Healthcare

TODO

Meta Analyses

How Meta-analyses and Systematic Reviews Work Together

Why Don't All Systematic Reviews Use a Meta-Analysis?

Protocol

Cohort_studies

Case_control_studies

Cross-sectional_studies