An Empirical Comparison of On-Screen Keyboards
Thomas Tullis, Ellen Mangan, and Richard Rosenbaum
Human Factors and Ergonomics Society 51st Annual Meeting, Baltimore, MD, October 1-5, 2007
Eight different designs for an on-screen keyboard were compared. Some of the factors manipulated included QWERTY vs. alphabetic layout, key-caps appearance vs. none, and mouseover feedback or not. We found that a design featuring two complete QWERTY keyboards (upper and lower case) yielded the best performance, but the designs that looked the most like a standard keyboard were preferred.
On-screen keyboards might be used in a variety of situations where it is not possible, practical, or even wise to use a normal keyboard. Perhaps the most widely known situation is on a PDA that does not have a keyboard. Tablet PCs are another example. But even with desktop PCs there are cases where an on-screen keyboard might be more appropriate to use. One example is for someone who has difficulty using a normal keyboard due to a disability or injury. A less widely known situation is one where the security of the PC in use might be compromised, and the user needs to login to a remote system to check email or perform other transactions. This might particularly be true when using a public PC.
So-called "keyboard sniffers" record the keystrokes typed on a PC for future analysis. These can be small hardware devices that are installed between the keyboard and CPU or they can be "spyware" software that records keystrokes in the background. In either case, the user may not be aware of the existence of the keyboard sniffer. By analyzing the log of keystrokes from a keyboard sniffer, someone might be able to detect repeated patterns which represent a user ID and password being used to access a remote system. One way to thwart a keyboard sniffer is to use an on-screen keyboard for the entry of very sensitive items, such as a password for entry to a remote system. In this case, the sensitive information is actually entered via clicks at certain coordinates on the screen rather than via the keyboard, so the keyboard sniffer has nothing to record.
There have been a few previous studies investigating certain types of
on-screen keyboards (e.g., MacKenzie & Zhang, 1999; Sears et al 2001;
Zhai et al, 2001), but none have compared a broad range of on-screen keyboards
to each other.
Methodology of the Study
The goals of the study were to assess user performance as well as subjective preference for a variety of on-screen keyboards in the entry of a password. We conducted it as an online study on our company's Intranet in order to get a large number of participants. We let the participants choose the password they would use for the study from a set that we provided. This was done to reduce the variability in the passwords used and to prevent the use of extremely simple (and non-secure) passwords. The passwords that we provided all met standard criteria for highly secure passwords, meaning that they contained all of the following types of characters:
Specifically, all of the passwords we provided contained two randomly selected English words with one upper-case letter each, two digits, and one or two special characters. The following passwords were provided for the participants to choose from:
The On-screen Keyboards
As shown in Figures 1 through 8, we created eight prototypes of various on-screen keyboards that utilized different layouts, visual appearance, ordering of the characters, and visual feedback. Two of the on-screen keyboards included a random feature (#6 and #7). This is for a possible security advantage that might be offered by having the mouse clicks occur in slightly different locations on each login.
Figure 1: Condition 1: Alphabetic layout, no mouseover feedback, no key-caps.
Figure 2: Condition 2: Larger QWERTY layout, no mouseover feedback, key-caps.
Figure 3: Condition 3: Same as Condition #1 but with mouseover highlighting of the characters.
Figure 4: Condition 4: Two smaller QWERTY keyboards, mouseover highlighting, no key-caps.
Figure 5: Condition 5: Same as Condition #2 but with no key-caps.
Figure 6: Condition 6: Same as Condition #2 but with random shuffling of the top row of keys.
Figure 7: Condition #7: Same as Condition #8 but with a split keyboard where the two halves separate a random amount from each other.
Figure 8: Condition #8: Smaller QWERTY layout, no mouseover feedback, key-caps.
Figure 9: Control condition used in study.
Each participant was randomly assigned to a condition set that consisted of the 8 on-screen keyboards randomly ordered so that any practice or fatigue effects would be distributed across the conditions. This was a within-subjects design, with each participant "logging in" four times with each on-screen keyboard condition.
Their first task was to select the password they would use for the rest of the study from a drop-down list displaying the six "strong security" passwords listed earlier. They then had to type that password. The program checked to make sure that the user truly had typed the password correctly, including letter case. If not, they were given error feedback and not allowed to proceed until they typed the password correctly. This was to ensure that the user understood the proper way to enter the password. Their chosen password was displayed to them on all subsequent pages below the area where they had to enter it. After this initial check, no feedback was given to the users about the accuracy of their entries (either in the control condition or the on-screen keyboard conditions).
Next, users were taken to the control condition, for normal entry of
their chosen password four times via the keyboard. Users then attempted
to login four times with each on-screen keyboard, and after each condition
they were taken to a rating page. Here they were asked to rate how easy
they thought that on-screen keyboard was to use on a 5-point scale.
For each condition we automatically recorded how long it took for each login attempt, the actual password entered on each attempt, and whether it was correct. After each condition we recorded the user-provided rating of the ease/difficulty of using that on-screen keyboard.
Since data from the participants was recorded at various points during their completion of the study, we have data from different numbers of participants at these various points, as follows:
In analyzing the data, we adopted the following rules for excluding some data from the analyses:
Figure 10 shows the mean time that each login attempt took the participants for each of the on-screen keyboards as well as the Control condition.
Figure 10. Mean time per login attempt. Error bars represent the 95% confidence interval.
As expected, users were significantly faster entering the password via the normal keyboard (Control condition) than any of the on-screen keyboards. In fact, they were about twice as fast with the normal keyboard. All of the on-screen keyboards yielded somewhat similar times, with means ranging from 18 to 22 seconds. The two on-screen keyboards with a random feature (Conditions 6 and 7) took slightly longer than any of the others. Conditions 2, 4, and 8 yielded slightly shorter times than some of the other on-screen keyboards.
Figure 11 shows the average time for each condition across the four login
attempts. All of the on-screen keyboard conditions showed a significant
practice effect, with the greatest improvement in time coming between
the first and second logins. The Control condition showed almost no practice
Figure 11. Time per login attempt for each trial and each condition.
Figure 12 shows the error rate for each of the conditions. These data represent the percentage of time the users entered a password that did not exactly match their chosen password, including letter-case errors, substitution errors, etc. It appears that case errors were the most common type (e.g., "Bagel_OCean38" instead of "Bagel_Ocean38").
Figure 12. Error rates in password entry for the conditions. Error bars represent the 95% confidence interval for the mean.
Overall, the error rates were relatively high. This could have been due to several factors:
The error rates for all of the on-screen keyboards were all significantly higher than for normal entry via the keyboard (the Control condition). But the error rate for one on-screen keyboard, Condition 4, was significantly lower than the error rate for all the other on-screen keyboards.
There were two subjective ratings given to each on-screen keyboard: one immediately after using it and another at the end of the study after they had used all of the keyboards. The mean ratings given immediately after using each keyboard are shown in Figure 13.
Figure 13. Mean ratings of ease of use, on a 5-point scale, given immediately after using each on-screen keyboard. Higher values are better. Error bars represent the 95% confidence interval for the mean.
Conditions 2 and 8, which are the two normal on-screen QWERTY keyboards (larger and smaller), received significantly better ratings than all other on-screen keyboards except Condition 4. Conditions 6 and 7, which are the two with a random feature, received significantly worse ratings than most of the other on-screen keyboards.
As often happens in usability studies, the performance data and subjective ratings show quite different patterns. Since the time data for the various on-screen keyboards did not show differences of any practical significance, we have focused more on the error data as the primary performance measure. It is clear from the error data that Condition 4 came out significantly better than all of the other on-screen keyboards. This condition used two complete QWERTY keyboards, one for upper case and one for lower case, stacked vertically. Key-cap appearances were not used, but visual highlighting on mouseover was used. The preference data, on the other hand, shows clearly that the users preferred the two on-screen keyboards which look the most like a traditional keyboard (Conditions 2 and 8). This is perhaps due to familiarity. Interestingly, though, Condition 4 (which yielded the best performance data) yielded the next highest ratings (after 2 and 8).
Overall, we believe Condition 4 appears to be the best alternative. We also believe that the following general conclusions can be drawn from this study:
MacKenzie, I.S. & S.X. Zhang. (1999) The design and evaluation of a high-performance soft keyboard. Proceedings of CHI'99: ACM Conference on Human Factors in Computing Systems. 1999. p. 25-31.
Sears, A., J.A. Jacko, J. Chu, and F. Moro (2001) The role of visual search in the design of effective soft keyboards. Behaviour and Information Technology, 2001. 20(3): p. 159-166.
Zhai, S., B.A. Smith, and M. Hunter (2001). Performance Optimization of Virtual Keyboards. Human-Computer Interaction, 2001.
Comments? Contact Tom@MeasuringUX.com.