UMD home UUP home UNIVERSAL USABILITY IN PRACTICE

Introduction Recommendations Guidelines Websites Conclusions Resources



TELEPHONE BASED ACCESS TO THE WEB – SPEECH RECOGNITION

Irina Ceaparu ( irina@cs.umd.edu)
Department of Computer Science
University of Maryland, College Park, MD 20742 USA
April 2001

 

       Introduction

       The WWW provides a huge world of information to which as many people as possible should have access. This includes people who can not or do not have access to a networked computer.

       Recent advances in wireless communication and speech recognition have made it possible to access the web from any place, at any time, by using only a phone. Some possible applications are browsing the web, getting stock quotes, verifying flight schedules, getting maps and directions, checking email, etc.

       Connecting to the Internet

       A voice user agent must perform audio rendering and must also provide user input mechanisms that control hyperlink selection, form field entry, and form submission.  Perhaps the simplest known terminal device that supports audio browsing is the common analog telephone. Telephones have, in fact, long supported automated information and transaction systems, known in the telecommunications industry as Interactive Voice Response (IVR) systems. 
       There is more than one way to connect to the Internet using a mobile phone.
       The most common technology used is WAP (Wireless Application Protocol) with its WML (Wireless Markup Language). WAP is used to deliver text and limited graphics to a small phone display.
       But phones can also offer voice access to the web. Companies like IBM, Motorola, Lucent and AT&T founded in 1999 the VoiceXML Forum to deal with speech technologies that make Internet accessible. The VoiceXML Forum aims to drive the market for voice-enabled Internet services through the creation of a common specification based on existing Internet standards. VoiceXML, a markup language for voice applications based on eXtensible Markup Language (XML), is expected to revolutionize the Internet industry by providing voice access to Web content and services.

       Advantages of phone (voice) access to the web:

  • Easy to use: Unlike a computer interface, a voice interface needs no keyboard, no mouse, no screen, freeing users from these barriers to access and action. It requires no training. It is accessible to anyone with a telephone.

  • Access from anywhere: Voice is mobile—information can be sent and retrieved from anywhere. Since customers can have access at anytime from anywhere, voice makes it possible to use time more effectively. Fast and efficient, voice frees users from not only the desktop, but even the laptop.

      Disadvantages of WAP phones access to the web:

  • Small screens: For web phones, there's an incredibly small viewing area; palmtops are barely better.

  • Speed of access: All devices have slow access.

  • Limited or fragmented availability: Wireless web access is sporadic in many areas and entirely unavailable in other areas.

  • Awkward input: Palm's Graffiti, touchtone pads, or even tiny QWERTY keyboards are awkward for any amount of writing, even a short email.

  • Price: Many technology limitations are being addressed by higher-end devices and services. But the entry price for a good wireless web palmtop with decent display, keyboard, and speed is easily $700 to $900, not including monthly access.

  • Lack of user habit: It takes some patience and overcoming the learning curve to get the hang of it -- connecting, putting in an address, typing. Users just aren't used to the idea and protocol yet.

 

      The two main approaches used for browsing the web using speech recognition are voice browsers and screen readers.
      Voice browsers allow voice-driven navigation, some with both voice-in and voice-out, and some allowing telephone-based web access.
      Screen-readers are used to allow navigation of the screen presented by the operating system, using speech or Braille output.

        Voice Browsers

      Voice Browsers offer the promise of allowing everyone to access Web based services from any phone, making it practical to access the Web any time and anywhere, whether at home, on the move, or at work. It is common for companies to offer services over the phone via menus traversed using the phone's keypad. Voice Browsers offer a great fit for the next generation of call centers, which will become Voice Web portals to the company's services and related websites, whether accessed via the telephone network or via the Internet. Users will able to choose whether to respond by a key press or a spoken command. Voice interaction holds the promise of naturalistic dialogs with Web-based services.
      Voice browsers allow people to access the Web using speech synthesis, pre-recorded audio, and speech recognition. This can be supplemented by keypads and small displays. Voice may also be offered as an adjunct to conventional desktop browsers with high resolution graphical displays, providing an accessible alternative to using the keyboard or screen, for instance in automobiles where hands/eyes free operation is essential. Voice interaction can escape the physical limitations on keypads and displays as mobile devices become ever smaller.
      Voice browsers offer the potential to expand the reach of the Web beyond the desktop or laptop and offer information in ways that most content authors have never imagined. Telephone access to web pages; browsers for the visually impaired; hands-free web surfing while driving a car; reading and language instruction for children and adult learners; intelligent alarm clocks that parse the day's news and present summaries upon verbal request.

       Examples of voice browsers:

  • ConversaWeb
    Voice-activated browser allowing spoken selection of links using "saycons".

  • webHearit
    A telephone-based tool using the telephone keypad as an interface to navigate suitably configured pages.

  • SpeecHTML
    A subscription service from Vocalis, allowing a participating site to provide telephone access using voice commands.

  • TelWeb
    An experimental telephone-based browser allowing access to any site using voice and dialled commands.

      

       Screen Readers

       A screen reader is a software that works together with a speech synthesizer to read aloud everything contained on a computer screen, including icons, menus, text, punctuation, and control buttons.
       Without screen readers, computers would be practically impossible for visually impaired people to use. But as with any assistive technology, screen readers can benefit many more people than the audience it was originally intended for. Potential users for screen readers include students of pronunciation, people learning languages with orthography different from their native language, and people learning to read.
       Screen readers can provide pronunciation input for students learning to improve their pronunciation of a foreign language. This software can be used to pronounce unfamiliar words to a student and also to provide a way for students to check their own pronunication against a model. Although most of the screen readers use speech synthesis software that is imperfect, most still use native like pronunciation, which rhythm and intonation being the weakest aspect of synthesized speech.
      An even bigger benefit is possible for learners of languages using orthographies different from their native writing system. For example, students learning Chinese often have difficulties with the huge numbers of characters that one must learn to read Chinese. For these students, Chinese enabled screen readers could read authentic web content to them, allowing them to access authentic resources in the target language despite their difficulties comprehending the writing system.

      Examples of screen readers:

  • ASAW from Microtalk:
    (DOS, Windows 95/98/ME) speech.

  • HAL from Dolphin:
    (DOS, Windows 95/98/ME and NT) speech and Braille.

  • JAWS For Windows from Freedom Scientific
    (DOS, Windows 95/98/ME and NT) speech and Braille.

  • Lookout from Choice Technology
    (Windows 95/98/ME)

  • OutSpoken from Alva:
    (Windows 95/98/ME, Macintosh) speech and Braille.


         Recommendations

       Much of WWW’s power comes from the fact that it presents information in a variety of formats while it also organizes that information through hypertext links.
       To create resources that can be used by the widest spectrum of potential visitors rather than an idealized "average," Web page designers should apply "universal design" principles. Universal design techniques can be applied in the design of packaging, software, appliances, transportation systems and many other products and services. Following universal design principles in creating a Web resource ensures that all Internet users can get to the information at a Web site regardless of their abilities, their disabilities, or the limitations of their equipment and software.
      Current cell phone enabled Web sites are carefully designed to be meet the requirements of the cell service provider since cellular bandwidth is minimal by Web standards, and sites not carefully designed for audio use will swamp a cell connection running at 16kbps or even 4kbps.
      Music, for example, doesn't travel well at all, and even voice may be distorted at times, as any cell phone user can attest.
      Graphics may be either impossible or very limited by both bandwidth and display capabilities.
      Navigation is also limited by the type of input devices available, typically a numeric keypad and two or three simple pushbuttons. In this sort of environment, simply reading a visual site aloud, as made possible by Cascading Style Sheets Level 2 (CSS2) and above, is impossible. A visual page simply has too much information on it for this approach to be practical in most cases. The navigation elements may or may not be suitable for keypad selection—most often they aren't.
      So most aural sites will have to be redesigned from the ground up to operate in parallel with a visual site, without requiring any more complexity from the Web server than that it recognize the user agent or browser being used to access a given URL on the server and serve up an appropriate version of the site for the capabilities of the user and his or her tool for accessing the Web.

 

       Guidelines

       Web pages should never be designed for certain types of browsers, but for all (potential) uses of information. All web documents should be equally accessible to visual and voice browsers.

       Design guidelines:

1. Code for content, not form

HTML tags should serve the logical organization of the information content of the page, being used as structural and not formatting elements. For example, heading tags should be used to denote section headings, instead of enlarging text.

2. Attach ALT attribute to each image

Screen readers will speak the content of the <ALT> tags, thus making pages more friendly to screen readers.

3. Avoid using tables for formatting

Table-formatted pages are hard to process by screen readers and voice browsers. Use the block-level formatting directives of CSS.

4. Avoid using frames

Screen readers have difficulties with framesets. Use the <NOFRAME> tag to identify content that can be read.

5. Use Aural Style Sheets

Aural Style Sheets are part of the Cascading Style Sheets, Level 2 [CSS2] specification, and provide for a level of control in "spoken" text roughly analogous to that for displayed/printed text. The use of an aural style sheet allows the author to specify characteristics of the spoken text such as volume, pitch, speed, and stress, indicate pauses and insert audio "icons" (sound files) and show how certain phrases, acronyms, punctuation, and numbers should be voiced.

6. Use audible horizontal and vertical markers

Explore the ability of the computer to generate a tone of a particular pitch and duration or some other sound when certain patterns of horizontal and vertical spaces occur in the document. These tones or sounds could be user defined through screen reader macros. Some specific examples are:

7. Use font alerts

         Relate font size to vocal pitch or an audible tone. For example, the bigger the font, the lower the pitch of the voice or the lower the tone used to represent that font.

       Navigation guidelines

1. Support reading of hypertext links

Users could hear a list of links and then say the number of the link they want to follow. A more sophisticated voice browser would allow the users to say a few words to indicate which link they are interested in. The browser could use simple template matching rules to select a matching link.

2. Easily navigate between windows

Support strategies that allow users to quickly move and listen to information anywhere in the document (search the document for the next sentence, paragraph or title).

       Speech guidelines

1. Make available a word pronunciation dictionary

In certain cases, the speech synthesizer is unable to correctly pronounce certain words or symbols found in a document. Most DOS screen readers do give the user the ability to change the pronunciation of a word or other groups of characters. This utility was added to screen readers when speech synthesizers were not very good at pronouncing all words accurately.

2. Make necessary exceptions in the period pause rule

Currently, most screen readers pause briefly at a period in order to enhance grammatical clarity. There are, however, conditions where these pauses are not desirable: in a table of contents where the period is used to separate the title from the page number or in a number containing one or more periods such as a dollar amount or a number designating a discreet document section.

3. Create a standard for speech grammars

None of the recognized media descriptors enable the specification of the speech recognition grammar.

 

       Websites

      Nuance
      www.nuance.com

Nuance develops, markets and supports a voice interface software platform that makes the information and services of enterprises, telecommunications networks and the Internet accessible from any telephone. Nuance is also driving the creation of the Voice Web and delivering software for V-Commerce (voice-enabled e-commerce) services and applications.

      Dragon Systems, Inc.
      www.dragonsys.com

Dragon Systems is a worldwide leader in PC and MAC speech recognition.Its most well known products are Dragon Dictate - a product that uses the discrete speech model (the best solution for persons with difficulty in language processing or in fluid speech) and Dragon NaturallySpeaking - a product that uses the continuous speech model.

      Lernout & Hauspie
      www.lhs.com/voicexpress

Lernout & Hauspie products are based on speech recognition technology developed by Kurzweil, a major pioneer in speech recognition. The current L&H product line is called VoiceXpress, which extends natural language support to include the Microsoft Office suite, plus Internet Explorer.

      IBM
      www.software.ibm.com/speech/

IBM is a major player in the speech recognition field. Its discrete speech product, IBM VoiceType, was a major competitor of Dragon Dictate. Its current product line, ViaVoice Millenium focuses on the continuous speech model.

      Interactive Telesis Inc.
      www.interactivetelesis.com

Interactive Telesis builds and hosts voice-recognition applications, but unlike NetByTel, it doesn't focus specifically on Web-based e-commerce applications. Its customer list includes university records offices and accounting firms. The types of applications it builds range from voice-mail access to interactive voice response programs, in which users punch buttons on touch-tone phones to input commands.

       World Wide Web Consortium
       www.w3.org/Voice

The World Wide Web Consortium (W3C) develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential as a forum for information, commerce, communication, and collective understanding.
The site offers a series of guidelines that support voice access to the web: speech grammars and synthesis, pronunciation lexicon, natural language representation.

       Web usability
       http://trace.wisc.edu/world/web/

The site includes numerous links to resources on the web about universal accessibility and usability. The links are grouped by sections: accessible web site guidelines; web access tools; disability and web use; discussion forums; organizations, projects and technologies addressing web access, etc.

       

       Conclusions

       Speech technology contributes to universal design by creating access mechanisms that appeal to both people with or without disabilities. The state of the art in speech technology, however, limits the ability of speech interfaces to create truly universal access. The development of futuristic technologies must be carefully explored, considering the user, the task, the context and the technology through human-computer interaction and human factors engineering methods, tools and techniques.

 

       Resources

1. Kynn Bartlett - Web Authoring Strategies for Voice Browsers
http://www.hwg.org/opcenter/w3c/voicebrowsers.html

2. W3C - Voice
http://www.w3.org/Voice

3. Alternative Web Browsing – W3C
http://www.w3.org/WAI/References/Browsing

4. Gregg Vanderheidden - Making screen readers work more effectively on the web
http://trace.wisc.edu/archive/screen_readers/screen.htm

5 Terry Thompson - Choosing and Using Speech Recognition Technology.
http://www.dcp.ucla.edu

6 Rajeev Agarwal - Voice Browsing the Web for Information Access.
http://www.w3.org/Voice/1998/Workshop/RajeevAgarwal.html

7. Ben Shneiderman – Designing the User Interface, Addison-Wesley, 1998

8. Daryle Gardner-Bonneau – Human Factors and Voice Interactive Systems
Kluwer Academic Publishers, 1999