Validity and reliability are important concepts in research. The everyday use of these terms provides a sense of what they mean (for example, your opinion is valid; your friends are reliable). In research, however, their use is more complex.
Suppose you hear about a new study showing depression levels among workers declined during an economic downturn. You learn that this study used a new questionnaire to ask workers about their mental health over a number of years. You decide to take a closer look at the strength of this new questionnaire. Was it valid? Was it reliable?
To assess the validity and reliability of a survey or other measure, researchers need to consider a number of things.
Ensuring the validity of measurement
At the outset, researchers need to consider the face validity of a questionnaire. That is, to a layperson, does it look like it will measure what it is intended to measure? In our example, would the people administering and taking the questionnaire think it a valid measure of depression? Do the questions and range of response options seem, on their face, appropriate for measuring depression?
Researchers also need to consider the content validity of the questionnaire; that is, will it actually measure what it is intended to measure. Researchers often rely on subject-matter experts to help determine this. In our case, the researchers could turn to experts in depression to consider their questions against the known symptoms of depression (e.g. depressed mood, sleeping problems and weight changes).
When questionnaires are measuring something abstract, researchers also need to establish its construct validity. This refers to the questionnaire’s ability to measure the abstract concept adequately. In this case, the researchers could have given a questionnaire on a similar construct, such as anxiety, to see if the results were related, as one would expect. Or they could have given a questionnaire on a different construct, such as happiness, to see if the results were the opposite.
It may sometimes be appropriate for researchers to establish criterion validity; that is, the extent to which the measurement tool is able to produce accurate findings when compared to a “gold standard.” In this case, the gold standard would be clinical diagnoses of depression. The researchers could see how their questionnaire results relate to actual clinical diagnoses of depression among the workers surveyed.
Ensuring the reliability of measurement
Researchers also need to consider the reliability of a questionnaire. Will they get similar results if they repeat their questionnaire soon after and conditions have not changed? In our case, if the questionnaire was administered to the same workers soon after the first one, the researchers would expect to find similar levels of depression. If the levels haven’t changed, the “repeatability” of the questionnaire would be high. This is called test-retest reliability.
Another aspect of reliability concerns internal consistency among the questions. Do similar questions give rise to similar answers? In our example, if two questions are related to amount of sleep, the researchers would expect the responses to be consistent.
Researchers also look at inter-rater reliability; that is, would different individuals assessing the same thing score the questionnaire the same way. For example, if two different clinicians administer the depression questionnaire to the same patient, would the resulting scores given by the two be relatively similar?
If our depression researchers were sloppy in ensuring the validity or reliability of their questionnaire, it could affect the believability of their study’s overall results. Although you can never prove reliability or validity conclusively, results will be more accurate if the measures in a study are as reliable and valid as possible.
Source: At Work, Issue 84, Spring 2016: Institute for Work & Health, Toronto [This column updates a previous column describing the same term, originally published in 2007.]