Our group's home is here. And a small creative world is here, too.

Audio Menus and Interfaces

Audio menus and interfaces can be defined as media that allow humans to interact with computers through a voice/speech platform to initiate an automated service or process. In the past, the prospect of controlling a machine by “talking” to it (like the characters in Star Trek did) used to be considered fantasies, and often appeared in various sci-fi movies. However, in recent years, with the advancement of technology, audio interfaces are becoming more and more common in our lives.

Why voice user interface?

There are many reasons that lead to the (relatively) recent developments of voice user interface. When users’ hands and eyes are busy, such as when they are driving or testing equipments, audio menus would prove to be useful, since the users can only use speech to make their choice. Also, many mobile phones and hand-held devices have small screens, which make displaying all the contents of the menu almost improbable; therefore audio menus would be helpful here. In addition, audio menus are also important in telephone interfaces, since it is not possible to display the options for the caller to see; the options need to be read out. Furthermore, if the user is vision-impaired, or is handicapped, audio menus would be more appropriate and beneficial for him/her, since graphic user interface would require users to view the options, and then make choices by using directional buttons to select.

Some voice user interface

There are various instances where audio menus and voice user interfaces are present in our lives. Many of the 3G generation handphones had a built-in voice recognition feature that allowed users to make simple commands, such as “call Tom”, via speech. Recently, voice recognition in smart phones is becoming more and more advanced, and allow us to interact with smart phones more effectively. Earlier this month, Apple has released its new iPhone 4S; and one of the features implemented was Siri – an application that serves as a “smart personal assistant”. This marks a new development of voice user interface, as well as artificial intelligence. Siri is now an integrated part of iOS 5 operating system.

Recently, Google chrome has also released an add-on called voice search, which allows you to search the web by speaking:

Another example of voice user interface is Interactive Voice Response (IVR) System – a technology that automates interactions with telephone callers, and is being used in the customer service hotlines of many companies. It allows users to interact with the company’s database via a telephone keypad or via speech recognition, after which they can service their own inquiries by following the IVR dialogues. The system can respond with pre-recorded or dynamically generated audio to further direct the users on how to proceed.

IVR is also used in the audio guides of various museums around the world.

General structure of a speech-enabled system

Typically, a speech enabled system consists of 4 components:

  1. A speech recognizer, which converts acoustical input into text strings.
  2. A speech analyzer to extract the meaning of the recognized text.
  3. A dialogue controller that performs predefined actions based on the extracted commands.
  4. A speech synthesizer (or speech generator) to generate acoustical output from the system’s answer.

Evaluation

Advantages

Voice user interfaces has many advantages. For example, audio guides in museums have been successful because they allow users to control the pace, while conveying the curator’s enthusiasm or author’s emotion. Also, as discussed earlier, in the cases when users’ hands and eyes are busy, or users are handicapped, voice user interface has a clear advantage over graphic user interface. Moreover, studies have been conducted to confirm that under harsh or cramp conditions preclude use of a keyboard (e.g. in underwater or rescue operations), voice user interfaces perform better than graphic interfaces.

Disadvantages

Speech recognition faces various obstacles

  • The presence of background noise might have unwanted effects on the accuracy of the input that was read in, and what the users originally said.
  • Increases cognitive load for users, compared to graphic method. This is because as each option is read out, users need to compare and them with his goal and place it on a scale between “zero match” and “exact match”.
  • Recognition errors might occur for similarly-pronounced words (such as dime/time or Houston/Austin).
  • Unstable recognition across changing users, environments, and time. This is because people from different regions might have different ways to pronounce of certain words, hence the system might not be able to recognize them.

Without well-functioning speech recognition, the meaningful string might not be properly extracted, which might lead to the system being unable to produce the desired output.

Also, in the case when users are driving, voice user interface might become a distraction, which may make them unable to concentrate on driving, hence compromising their safety. Studies have shown that speaking is more demanding of users’ working memory than is hand/eye coordination. Speech requires use of limited resources, while hand/eye coordination is processed elsewhere in the brain, allowing a higher level of parallel processing. In other words, multi-tasking is more difficult to achieve while speaking.

Furthermore, the slow nature of speech output means that it will take a longer time for the options to be read out to users than does displaying the options and let users view. This might lead to longer task completion time, which translates to lower productivity.

Another problem is that the ephemeral nature of speech may cause a short-term memory problem to users: they might have forgotten what options were read out earlier, and the system might have to repeat them. This, again, leads to higher task completion time, and lower productivity.

Some Design Consideration

With these advantages and disadvantages in mind, designers have might have to consider a few things when designing audio menus and interfaces:

  • Complex menus should be avoided to help lower the cognitive load on users. Number of choices should be limited to about 3-4 to avoid memorization problem.
  • A mechanism to repeat the options must be implemented, as users might forget the first few options that were read out.
  • A mechanism to allow users to accept options immediately after it was read out (if the user feels that it is an exact match of what he wants) in order to boost productivity and save user’s time. An option for users to increase the speed at which options are being read might also be implemented to meet this requirement.
  • If the system is going to be used in noisy environment, it might not be advisable to use audio interfaces, because of the possible interference.

Since IVR systems are very commercial-oriented and are normally used as a means of communication between companies and their customers, these systems must be as user-friendly as possible, so as anyone without any prior knowledge can use them. A possible guideline of IVR systems can be found here.

Conclusion

In conclusion, audio menus and interfaces are becoming more and more widely used in real life. Their development and application for users with disabilities have helped people who have been paralyzed, bedridden, or injured, to broaden the horizons of their lives. The benefits for the handicapped is clearly very rewarding. However, these interfaces also have their disadvantages, most notably with the voice recognizer.It will be long before voice user interface completely replaces the mainstream player – graphic user interface, but once voice recognition technology develops, audio menus and interfaces will pose a big challenge to the traditional interfaces. For now, they should be applied where appropriate, but care must be taken to ensure that performance is genuinely improved over other interaction strategies.

Reference:

B. Shneiderman, C. Plaisant (2010). Designing the User Interface.

G. Fiedler, P. Schmidt. Developing Interactive Voice Response Interfaces for Large Information Systems.

Leave a comment