Shore '00: Student HCI Online Research Experiments

University of Maryland

Abstract
Introduction
Experiment
Results
Discussion
Conclusions

Acknowledgments
References
Appendices
Credits
Feedback

Back To Main

A Comparison of Voice Controlled and Mouse Controlled Web Browsing

Introduction

Information contained on the World Wide Web is inaccessible to many people. The web is primarily a visual medium that requires a keyboard and mouse to navigate, and this disenfranchises several types of users. People who lack motor skills to use a keyboard and mouse find navigation troublesome. Visually impaired users can not read the display. People who do not have access to an Internet-capable computer have difficulty even accessing the World Wide Web, and those who temporarily cannot use a traditional web browser (for example, because their eyes or hands are occupied or because they are not near their computer) are at a minimum inconvenienced.

Speech recognition and generation technologies offer a potential solution to these problems by augmenting the capabilities of a web browser. A voice browser is a web browser with at least one of the following capabilities:

  • Can render web pages in an audio format (speech generation)
  • Can interpret spoken input for navigation (speech recognition)

A number of voice browsers are on the market, and more are under development. Conversational Computing's Conversa is a web browser that accepts speech input, but renders the pages in the traditional visual manner [18]. The Home Page Reader, from IBM, renders web pages in audio format, but accepts commands only via the keyboard's number pad [20]. PipeBeach is a system that affords both audio rendering of web pages as well as speech input. LIASON, from Siemen's, Inc., is a system designed for use while driving an automobile[25]. Systems specifically designed to accommodate telephone-based browsing include Lucent's PhoneBrowser, Siemen's DICE, and 1-800-Hypertext [1,4,25]. Other systems are application-specific. VADAR, from BBN, allows users to track shipments over the world wide web, while Talk'n'Travel, also from BBN, is an interface for commercial-travel websites that allows users to access flight and train schedules [20]. The GALAXY project at MIT is a system that will access the web to find information in response to a user's queries [21].

The handicapped stand to gain much from such products. A web browser that can render web pages in audio format will be of obvious use to the blind, and navigating by voice obviates the need for keyboard and mouse navigation. Additionally, people whose eyes and hands are otherwise engaged may still be able to conveniently access the web. For example, someone will be able to get directions via the web while driving their car.

Voice browsers open up new possibilities for bringing the content of the web to a larger segment of the population. A voice browser potentially makes the telephone capable of Internet access. Since the number of households with telephones is far greater than number of households with internet-capable computers, it stands to reason that the number of people with internet access will increase greatly once voice browsers become widely available. Moreover, telephones are far less expensive than computers, so voice browsing will help open up the world wide web to low-income users. Also, sales of wireless telephones are flourishing; voice browsing may enable the owners of such phones to browse the web wirelessly from virtually any location.

There are several challenges facing voice browsers. First, a web page rendered with voice output is inherently a temporal medium. In a visually presented web page, many different images, tables and the like can be presented on the screen at the same time, in a spatial format that is quickly and effectively processed by the human perceptual system. Spoken text, however, can only be presented one word at a time. While some research has gone into using multiple, simultaneous, non-speech sounds, reading of screen contents can only occur in a sequential, linear fashion.

Second, formulating speech commands and processing speech output consumes the users' short-term and working memory and conflicts with tasks such as planning and problem solving that depend on the same forms of memory. Visual information is processed in a separate system, permitting parallel operations. A study be Karl et al. noted that subjects had more difficulty memorizing symbols when commands to manipulate those symbols were issued by voice than when commands were issued via the keyboard or mouse [11].

Third, there is the inevitable recognition error involved in speech recognition systems. Recognition error refers to situation in which the user speaks one word but the system chooses another as the best match. After nearly 30 years of research in the area of natural language recognition, the best systems remain relatively unsophisticated. A recent system boasts a recognition rate of 93% with a vocabulary of 1000 words, and even this requires background lexical and syntactic knowledge [8]. While users tend to view recognition error as a sign of immature technology, some researchers believe that recognition error is inevitable [3].

Research has not proven the effectiveness of speech recognition as a general-purpose input mode. A study by Van Buskirk and LaLomia had subjects complete tasks involving navigation in a graphical user interface (GUI). In half the tasks subjects used spoken commands and in the other half subjects used a keyboard and mouse. They found that voice navigation took approximately twice as long as traditional navigation [23]. Earlier studies produced inconsistent and conflicting results [6,12,24].

Speech input can be useful in certain situations. Research into multimodal interfaces indicates an distinct advantage to using speech as an input mechanism. The aforementioned study by Karl et al. showed that using speech to issue commands to a word processing application, while using the keyboard for text entry and the mouse for direct manipulation, significantly sped up task time . Similarly, Mignot et al. showed that the addition of spoken commands to direct manipulation (via a touchpad) greatly improved the task performance times of their subjects [13].

Two common threads run through both of these experiments. First, in both tasks, the number of commands that could be issued via spoken input is relatively small. Second, the users spoke very short sentences. For example, in the Karl study, speech input was only used for commands such as "File Open", or "Save". Even the Van Buskirk and LaLomia study, which demonstrated a significant performance decrease associated with voice navigation, noted that, "the best tasks for speech input were tasks in which the user has to issue brief commands using a small vocabulary".

There is some theoretical and experimental justification for this. A study by Poock [17], cited in the Karl paper, demonstrated a clear advantage for issuing commands by voice over issuing commands via keyboard. Oviatt, in [15], showed that both the length of spoken commands and lack of structure in input format is proportional to the number of disfluencies made by the user. A speech disfluency is any type of unnatural disruption in normal speech, such as a repetition, filled pause (e.g. "umm"), self-correction, or false start. Oviatt claims that long sentences lead to more complicated plans for formulating input and that these more complicated plans are more prone to errors. Also, if the input grammar is unstructured, users have more options in formulating their input, which leads to more disfluencies.

Oviatt's focus on the types of errors and the ease with which they can be corrected reflects the current trend in speech input research. This research suggests that error detection and correction is the crucial factor in determining task completion times. Danis and Karat found that when using speech-recognition systems, the types of errors users commit are fundamentally different from errors committed with other input styles [3]. This tends to confuse users who are not accustomed to recognizing and correcting such errors. A study by Karat, et al., noted that, "when subjects made errors in keyboard-mouse text entry, they tended to correct the error within a few words of having made it. In contrast, some subjects made specific mention of not being as aware of when a misrecognition had occurred and needing to `go back to' a proofreading stage for the speech tasks" [10]. This same study noted that subjects made almost four times as many errors using speech recognition for transcription tasks as they did when using keyboard-and-mouse. A follow-up study [7] investigated user strategies for correcting errors. They identify two common strategies: spiral depths, where users re-dictate misinterpreted words, and cascades, where misrecognition (frequently of commands) caused addition errors, which needed to be corrected before the original error could be dealt with. Similar effects are noted in [16].

The characteristics of web navigation may be advantageous to voice-controlled navigation. A small number of commands can provide the navigation functionality common to visual browsers, and current technology is effective for small vocabularies. Web navigation commands are typically short, such as "go back", "follow link", "refresh", and "read next frame." Short commands such as these have a very high degree of structure; in fact, there is almost no grammar to speak of, as each command is maps to exactly one combination of sounds. (Some voice browsers, notably IBM's Home Page Reader and telephone browsers, ignore this problem altogether by using the number pad for input.)

There are potential limitations, though, that may reduce the utility of voice browsing. Although there is a small set of commands, users must typically speak the text of the link (i.e., the text that a mouse user would click) to follow the link. A web page author can use virtually any string of characters to represent links, which creates a potentially unlimited universe of valid voice commands with very little structure, not all of which are valid English words. These must be spelled out letter by letter for the speech recognition system to properly recognize the link. The error rates associated with such large, complicated, unstructured command set may be quite large.

The central question we sought to answer is whether navigating the World Wide Web by voice is a viable alternative to traditional mouse-based navigation. Would it produce results similar to those found in the Van Buskirk and LaLomia study (slower) or those found by Karl et al. (faster)? Based on the literature and our experience, we hypothesized that speech navigation will be noticeably slower than navigating with the mouse, but not quite twice as slow.

Our experiment also attempted to discern when numbered links are more helpful than text links as navigational aids. We hypothesized that, due to the simplicity of the spoken commands when using the numbered links, navigation with voice and numbered links would be faster and less error prone than navigation with voice and text links. Finally, we anticipated that users would appreciate the voice control capability because of the flexibility and novelty, and that this would be reflected in higher subjective satisfaction ratings for the voice methods.

We limited our experiment to three common web navigation patterns: the hierarchical menu, the linear slide show, and a two dimensional panning map (no zooming). We used a single voice browser product, Conversa, which renders pages visually and supports voice as well as the traditional "point and click" technique with a mouse. The user typically traverses links by speaking the text of the hyperlink (i.e., the text that a traditional user could click with a mouse). Image maps, links containing text which is not English, and links in densely packed regions of many links are assigned a number (sequentially in a top-to-bottom, left-to-right manner), and the user speaks that number to follow the link.

Conversa does not require or support user-specific speech recognition training. It provides a limited set of preferences to customize the tool. For speech recognition, the user can adjust for the speaker's voice pitch (male, female or child) and speech recognition precision (from lenient to strict in 5 increments). It is positioned as a mass-market product for use by both experts and novices.



Departm ent of Computer Science: Direct questions and comments to the student editorial team

University of Maryland