Voice technologies something to shout about

Embedded speech recognition, developing in tandem with the growth of distributed speech recognition (DSR) systems, will be the next class of voice technologies that will be important in Asia, said Steve Chambers, vice president of Worldwide Marketing at SpeechWorks International Inc.

Unlike text-to-speech and automatic speech recognition software that reside on the server, embedded technologies reside in devices such as mobile phones and PDAs (personal digital assistants). They enable users to use voice to interact with the device, for example, to pull out the calendar, e-mail or other information, said Chambers, who was speaking at a seminar on “Voice: Technologies, Applications and Services”, which was held in Singapore last week.

With DSR, such embedded speech recognition systems can be integrated with speech recognition software residing on the server. Because devices tend to be closer to the source of the speech – for example, take the case of a person talking into a mobile phone – there will be more accurate voice capture, said Chambers.

This paves the way for new and enhanced voice applications.

The European Telecommunications Standards Institute (ETSI), which created a new working group last year to consider standards for DSR client-server protocols, gave the example of a user dictating meeting notes directly into a voice-enhanced mobile handset. By the time he returns to the office, the draft text is ready on his PC for editing.

Other factors in favour of speech deployments in Asia include the difficulty of language input using the keypad, and the prevalence of code mixing in natural speech.

In other voice-related developments, Chambers said new technologies would also emerge to overlay video onto speech. For example, a user will be able to speak to the PDA and retrieve visual information. “The technology will be multi-modal, involving wireless, audio in, audio out, and tapping on the PDA,” he said.

Chambers gave the example of a user asking the PDA for directions. The device will then be able to communicate with a server to retrieve a map.

“Voice will be used to pull rich media into devices,” said Chambers. “We’re looking at these emerging in late 2002 or in 2003.”

Currently, there are speech recognition deployments in the automotive sector, where car controls can be operated using voice. Other examples of voice deployments include Federal Express, which uses voice applications to give customers its rates, Yahoo, which uses a text to speech system enabling subscribers to dial in and get e-mail read out to them, and E*Trade, which uses a natural language system for stock trading.

Commenting on the Asian opportunity, Alex Leung, CEO of InfoTalk, noted that in China, for example, there are 250 million phones but only six million PCs connected to the Internet. “With the wireless explosion, the ratio can only go up,” he said. “The mobile and handheld are still the easiest ways to access information,” he added, noting that not everyone knows how to use a computer.

The fact that speech recognition is taking off now, after being the stuff of fiction for decades, is also due to the confluence of two factors – greater efficiency in speech technology, and faster processor speeds, said Leung. “According to Moore’s law, processors are becoming faster every year. At the same time, speech technology, the processing of speech algorithms, becomes more efficient. Every two years, the need for processing power decreases exponentially,” he pointed out.

“This is one of the major sparks for commercialization. Commercialization becomes viable,” he said.

The European Telecommunications Standards Institute (ETSI) has noted that by using a client/server approach in combination with the latest recognition systems, DSR will deliver the price/performance levels and access flexibility that will begin to make (voice access to applications) practicable and affordable.

According to the ETSI Web site, a DSR system overcomes problems such as the degradation in the performance of speech recognition systems due to transmission over mobile channels. The degradation is due to both the low bit-rate speech coding and channel transmission errors.

A DSR system overcomes this by eliminating the speech channel and instead use an error-protected data channel to send a parameterized representation of the speech, which is suitable for recognition. The processing is distributed between the terminal and the network. The terminal performs the feature parameter extraction, or the front-end of the speech recognition system. These features are transmitted over a data channel to a remote “back-end” recognizer.

The end result is that the transmission channel does not affect the recognition system performance and channel invariability is achieved.