In a researcher’s perfect world, everyone asked to participate in a study would say yes, no one would drop out along the way, and all items on a questionnaire would be answered.
Alas, researchers don’t operate in a perfect world. The information they collect often has holes, and they come up against a common challenge in their pursuit of answers to research questions: missing data.
Data can be missing for a number of reasons. Study participants may not answer a certain survey question because they don’t see how the question applies to them. This would be the case if a survey asked, “In what year did you get married?” and the respondent is single. Or study participants may simply refuse to answer. This sometimes occurs in response to the question, “What is your income?”
Data are also called “missing” when people decline to take part in a study or drop out along the way. Take, for example, a study looking at return to work among injured workers. Some injured workers may not take part because they don’t want to “make waves,” or may later withdraw from a study because their situation has changed (e.g. they feel better or worse, or have competing demands on their time).
Some studies are based not on information collected by the researcher, but on administrative information collected by a public agency. This data, too, can be incomplete. At the Institute for Work & Health, for example, many studies rely on claims data from the Workplace Safety and Insurance Board. A claim file could be missing information — say, on marital status or employment start-date — that is relevant to a study.
Does missing data matter?
Missing data may or may not be a problem. Most important is whether or not the data are missing at random. If there is a pattern to the missing information, then drawing wrong conclusions is much more likely. For example, the results of a workers’ compensation study could be skewed if those who refuse to take part largely come from a vulnerable group like recent immigrants, or if most of those who drop out do so because they have recovered. In other words, information that is consistently missing from the members of one group is a problem.
The impact of missing data also depends in part on the research question. If a study is looking at the relationship between health and socioeconomic status — as measured by income — then missing income information could be an issue. This is especially the case if people refuse to provide the information because of what their incomes are (e.g. on the high and low ends of the scale). If a study is looking at the relationship between health and marital status, then missing income information may not be important.
How much information is missing is also a factor. If two per cent is missing, then sound conclusions are likely still possible. The same can’t be said if 20 per cent is missing.
What can be done about missing data?
Researchers don’t necessarily call it quits when information is missing. They deal with the problem in a number of ways. Indeed, whole books have been written about missing data in the field of statistical analysis.
Researchers might simply discard any record (e.g. questionnaire or claim file) that is missing information. Or they might “fill in” the missing data using what are called “imputation,” weighting or model-based procedures. These procedures are complicated. Each has its place, and none is perfect. Therefore, researchers need to be very clear in the “limitations” section of their studies about what information is missing and how that may affect results.
Source: At Work, Issue 56, Spring 2009: Institute for Work & Health, Toronto