Making the Most of Your Survey Items: Item Analysis
By Louis Rocconi, Ph.D.
Hi, blog world! My name is Louis Rocconi, and I am an Associate Professor and Program Coordinator in the Evaluation, Statistics, and Methodology program at The University of Tennessee, and I am MAD about item analysis. In this blog post, I want to discuss an often overlooked tool to examine and improve survey items: Item Analysis.
What is Item Analysis?
Item analysis is a set of techniques used to evaluate the quality and usefulness of test or survey items. While item analysis techniques are frequently used in test construction, these techniques are helpful when designing surveys as well. Item analysis focuses on individual items rather than the entire set of items (such as Cronbach’s alpha). Item analysis techniques can be used to identify how individuals respond to items and how well items discriminate between those with high and low scores. Item analysis can be used during pilot testing to help choose the best items for inclusion in the final set. While there are many methods for conducting item analysis, this post will focus on two methods: item difficulty/endorsability and item discrimination.
Item Difficulty/Endorsability
Item difficulty, or item endorsability, is simply the mean, or average, response (Meyer, 2014). For test items that have a “correct” response, we use the term item difficulty, which refers to the proportion of individuals who answered the item correctly. However, when using surveys with Likert-type response options (e.g., strongly disagree, disagree, agree, strongly agree), where there is no “correct” answer, we can think of the item mean as item endorsability or the extent to which the highest response option is endorsed. We often divide the mean, or average response, by the maximum possible response to put endorsability on the same scale as difficulty (i.e., ranging from 0 to 1).
A high difficulty (i.e., close to 1) indicates an item that is too easy, while a low difficulty value (i.e., close to 0) suggests an overly difficult item or an item that few respondents endorse. Typically, we are looking for difficulty values between 0.3 and 0.7. Allen and Yen (1979) argue this range maximizes the information a test provides about differences among respondents. While Allen and Yen were referring to test items, surveys with Likert-type response options generally follow the same recommendations. An item with a low endorsability indicates that people are having a difficult time endorsing the item or selecting higher response options such as strongly agree. Whereas, an item with a high endorsability indicates that the item is easy to endorse. Very high or very low values for difficulty/endorsability may indicate that we need to review the item. Examining proportions for each response option is also useful. It demonstrates how frequently a response category was used. If a response category is not used or only selected by a few respondents, this may indicate that the item is ambiguous or confusing.
Item Discrimination
Item discrimination is a measure of the relationship between scores on an item and the overall score on the construct the survey is measuring (Meyer, 2014). It measures the degree to which an item differentiates individuals who score high on the survey from those who score low on the survey. It aids in determining whether an item is positively or negatively correlated with the total performance. We can think of item discrimination as how well an item is tapping into the latent construct. Discrimination is typically measured using an item-total correlation to assess the relationship between an item and the overall score. Pearson’s correlation and its variants (i.e., point-biserial correlation) are the most common, but other types of correlations such as biserial and polychoric correlations can be used.
Meyer (2014) suggests selecting items with positive discrimination values between 0.3 and 0.7 and items that have large variances. When the item-total correlation exceeds 0.7, it suggests the item may be redundant. A content analysis or expert review panel could be used to help decide which items to keep. A negative discrimination for an item suggests that the item is negatively related with the total score. This may suggest a data entry error, a poorly written item, or that the item needs to be reverse coded. Whatever the case, negative discrimination is a flag to let you know to inspect that item. Items with low discrimination tap into the construct poorly and should be revised or eliminated. Very easy or very difficult items can also cause low discrimination, so it is good to check whether that is a reason as well. Examining discrimination coefficients for each response option is also helpful. We typically want to see a pattern where lower response options (e.g., strongly disagree, disagree) have negative discrimination coefficients and higher response options (e.g., agree, strongly agree) have positive correlations and the magnitude of the correlations is highest at the ends of the response scale (we would look for the opposite pattern if the item is negatively worded).
Conclusion
Item difficulty/endorsability and item discrimination are two easy techniques researcher can use to help improve the quality of their survey items. These techniques can easily be implemented when examining other statistics such as internal consistency reliability.
___________________________________________________________________
References
Allen, M. & Yen, W. (1979). Introduction to measurement theory. Wadsworth.
Meyer, J. P. (2014). Applied measurement with jMetrik. Routledge.
Resources
I have created some R code and output to demonstrate how to implement and interpret an item analysis.
The Standards for Educational and Psychological Testing