Annotated Bibliography of Rating Scale Literature

from 2009 Mini-UPA Conference

by Joe Dumas and Tom Tullis
Originally posted May 25, 2009

Albert, W., and Dixon, E. (2003). Is This What You Expected? The Use of Expectation Measures in Usability Testing. Proceedings of Usability Professionals Association 2003 Conference, Scottsdale, AZ, June 2003.

Introduced the concept of ratings both before and after a task.

Bangor, A., Kortum, P. T., & Miller, J. T. (2008). An empirical evaluation of the System Usability Scale. International Journal of Human-Computer Interaction, 6, 574-594.

Thorough analysis of SUS.

Brooke, J. (1996). SUS: A Quick and Dirty Usability Scale. In: P.W. Jordan, B. Thomas, B.A. Weerdmeester & I.L. McClelland (Eds.), Usability Evaluation in Industry. London: Taylor & Francis, 189-194.

The key article on the System Usability Scale. Also available at http://www.usabilitynet.org/trump/documents/Suschapt.doc

Chen, C., Lee, S., & Stevensen, H. (2006). Response Style And Cross-Cultural Comparisons Of Rating Scales Among East Asian And North American Students. Psychological Science, 6, 170-175.

Japanese and Chinese students were more likely to choose the midpoint of scales than North American students.

Chin, J. P., Diehl, V. A, & Norman, K. (1988). Development of an instrument measuring user satisfaction of the human-computer interface, Proceedings of ACM CHI 1988 (Washington, DC), pp. 213-218.

QUIS post-test questionnaire - requires a $750 license - www.lap.umd.edu/QUIS/index.html

Cox, E. P. (1980) The optimal number of response alternatives for a scale: a review, Journal of Marketing Research, 17, 407-422.

Recommends 5-9 levels in a rating scale.

Dumas, J. (1998b). Usability Testing Methods: Subjective Measures: Part II - Measuring Attitudes and Opinions. October issue of Common Ground, The newsletter of the Usability Professionals' Association, 4-8.

Rules for constructing rating scales.

Hassenzahl, M., Beu, A. & Burmester, M. (2001). Engineering joy. IEEE Software, 70-76.

Presents scales for measuring affective reactions to products.

Hornbæk , K., & Law, E. (2007). Meta-analysis of correlations among usability measures. In Proceedings of CHI 2007 (pp. 617-626). San Jose, CA: ACM.

A study of the relationships between measures such as time, errors, ratings, etc. Limited to data from published studies, not industry tests.

Kirakowski, J., & Corbett, M. (1993). SUMI: The Software Usability Measurement Inventory. British Journal of Educational Technology, 24, 210-212.

One of the original studies of SUMI - http://sumi.ucc.ie/ - license fee of 500 Euros.

Lewis, J. R. (1991). Psychometric evaluation of an after-scenario questionnaire for computer usability. studies: The ASQ. SIGCHI Bulletin, 23, 1, 78-81.

One of the first usability rating questionnaires. A three scale questionnaire with reliability and validity data.

Lewis, J. R. (2002). Psychometric evaluation of the PSSUQ using data from five years of usability studies. International Journal of Human-Computer Interaction, 14, 463-488.

Study about the Post Study System Usability Questionnaire (PSSOQ), a 19 item post-test set of ratings.

Likert, R. (1932), A technique for the measurement of attitude, Archives of Psychology, 140, 1-55.

The original article on the Likert scale.

Ostrom, T. M. & Gannon, K. M. (1996) Exemplar generation: Assessing how respondents give meaning to rating scales, In Schwarz, N. & Sudman, S. (Eds) Answering questions: methodology for determining cognitive and communicative processes in survey research, San Francisco: Jossey-Bass, 293-318.

Explores the bipolar nature of rating scales.

McGee, M. (2004). Master Usability Scaling: Magnitude Estimation and Master Scaling Applied to Usability Measurement. Proc. CHI 2004 ACM Press (2004), pp. 335-342

Using magnitude estimation to rate the "usability" of products. In this study, the users' concept of usability is determined by having users rate adjectives.

Presser, S. & Schuman, H. (1980) The measurement of a middle position in attitude surveys. Public Opinion Quarterly, 70-85.

Study shows the value of having a middle position in a rating scale.

Preston, C. C., & Colman, A. M. (2000). Optimal number of response categories in rating scales: Reliability, validity, discriminating power, and respondent preferences. Acta Psychologica, 104, 1-15.

"Taken together, the results reported above suggest that rating scales with 7, 9, or 10 response categories are generally to be preferred."

Rich, A. & McGee, M. (2004). Expected Usability Magnitude Estimation. Proceedings Of The Human Factors And Ergonomics Society 48th Annual Meeting, 912-916.

An empirical study of before and after task ratings using magnitude estimation.

Sauro, J. & Dumas, J. (2009). Comparison of Three One-Question, Post-Task Usability Questionnaires. Proceedings of. CHI 2009, 1599-1608

A study of three post-task rating scales - Likert, SMEQ, and magnitude estimation. Likert and SMEQ were best.

Sauro, J. & Lewis, J.R.(2009) Correlations among Prototypical Usability Metrics: Evidence for the Construct of Usability. Proceedings of CHI 2009, 1609-1618.

An analysis of data from 90 industry usability tests. Argues for a concept of usability from correlations between measures.

Schwarz, N, Knauper, B., Hippler, H., Noelle-Neumann, & Clark, L. (1991) Rating scales: Numeric values may change the meaning of scale scales, Public Opinion Quarterly, 55, 570-582.

Study showing why you should avoid a -3 to +3 numeric scale.

Teague, R, DeJesus, K., & Nunes-Ueno, M. (2001). Concurrent vs. Post-Task Usability Test Ratings. Proceedings of CHI 2001, 289-290.

Study of ratings given during and after a task.

Tedesco, D. & Tullis, T. (2006). A Comparison of Methods for Eliciting Post-Task Subjective Ratings in Usability Testing. Usability Professionals Association (UPA), 2006, 1-9.

A study of several post-task rating scales including 4 versions of Likert scales. Recommends one of them as the most reliable and sensitive scale to use.

Tullis, T. & Albert, W. (2008). Measuring the user experience. Morgan Kaufman.

See Chapter 6 - Self-reported Metrics - also www.measuringux.com

Tullis, T. and Stetson, J. (2004). A Comparison of Questionnaires for Assessing Website Usability. Usability Professionals Association (UPA), 2004, 7-11.

A study of several popular post-test questionnaires. SUS comes out best.

Zijlstra, F.R.H & Doorn, L. van (1985). The Construction of a Scale to Measure Subjective Effort. Technical Report, Delft University of Technology, Department of Philosophy and Social Sciences.

Early study of the Subjective Mental Effort Scale (SMEQ), a single item difficulty rating.

Comments? Contact Tom@MeasuringUX.com.

Measuring UX Homepage