Humans have been communicating using speech far longer than they have used the written word. But as humanity spread to the four corners of the earth, writing took over as the primary means of disseminating information.
The advent of the telephone, and later television, brought speech back to the forefront but, in the blink of an evolutionary eye, the Internet inundated the communication world with billions of pages of text, thus creating an information medium stand-off.
As it integrates itself into the Internet, voice will potentially never be dethroned again. Though we are still years from the Star Trek world of voice-activated talking computers, technology makers are fast developing the means to seamlessly meld the spoken and written word into a multidimensional Internet.
Today we use voice-enabled applications and voice-recognition technology for a variety of tasks but, for the most part, the technology is just a means of accelerating the information gathering process, instead of a complete replacement of human interaction. Take directory assistance for example. Today we call 411 and are asked whether we want to speak in English or French, the city required and whether the number is business or residential. At that point, having narrowed down the possibilities, a human takes over since voice-enabled technology is not accurate enough.
Bob Miele, director of product management with Phonetic Systems Inc. in Burlington, Mass., said increased accuracy can be achieved when the technology is not searching for actual words and names but rather phonemes, the phonetic portions comprise a word. Thus the name Brown, though a single syllable when spoken, is really three phonemes: brr, oww, and nn. Miele is confident in his own company’s technology. “What we would propose is to remove the operator and let our software…perform a match.”
But the list of surnames is prodigious (just check out any big city phone book) and the variety of English accents almost limitless. Thus, the potential number of ways to mispronounce Thiruppatankurunathan could confuse current voice recognition technology.
i say tomato
Voice-enabled applications have their work cut out. At their core is recognition technology.
An example of how difficult it is to create foolproof voice recognition could be had by sitting down with a Brooklynite, an Australian, a Brit and Newfoundlander. You then start to see what the technology is up against. The Brooklynite might get a blank stare as he asks for directions to the “boids” and “toitles” at the zoo. The Aussie might receive the same blank response ordering a “sex pack of beer.”
In creating its voice enabled technology to surf the Internet, Redmond, Wash.-based Conversay Computing Corp. had to deal with these issues.
“You have to have a different language model for people who have significantly different accents,” said Ora Williamson, general manager of embedded solutions at Conversay. She said the company has different versions of its voice-enabled surfing software for the American and British markets. For the software, Conversa Web, to be able to understand the variances in accents, a sampling of hundreds of native speakers was required.
Steve Chambers, vice-president of world-wide marketing for Boston-based SpeechWorks, said even after installing a voice enabled call centre solution, for example, you can have unforeseen problems. After all, technology is only as useful as its user makes it.
“You never know how people are going to react to certain prompts until you say them,” he explained. When SpeechWorks implemented a flight information solution for United Airlines, they were surprised when customers said “Chicago” when asked for the name of their destination state. A human call centre operator would instantly catch the mistake, but a machine is looking for an expected response and is ill prepared when it doesn’t get it.
Or consider the Hewlett-Packard centre that had callers asking for information on the Sony Trinitron. “We never would have known that the 1-800 number was one digit off from Sony, unless we had done a pilot,” Chambers said.
the voice Web cometh
The voice Web will be bigger than the world wide Web, predicted Bruce Eibsvik, vice-president of sales and marketing of Voice Genie Technologies Inc. in Toronto.
He gave two reasons for this. There is not, and may never be, 100 per cent adoption of the PC, and secondly the incredible expansion of the wireless environment. Eibsvik does not envision the voice Web as a group of people using the telephone to surf the Internet, but rather an environment with voice services. “We will use the phone much the way we use it today, not as a Web surfing tool but a communications tool and a transactions tool.”
The wireless world is driving the growth of an entire area of voice-enabled applications as users want access from all of their peripherals, from cell phones to PDAs to a talking car.
“What the voice Web is doing is extending the Web to the mobile environment,” Eibsvik said.
Tom Houy, manager of client systems marketing for IBM Corp.’s voice systems in West Palm Beach, Fl., agrees. “The reality of it is voice will be the interface of choice for almost every mobile device out there,”
“The voice interface is very nice in that it gets rid of what most people consider as menu hell,” he explained. Instead of poking and prodding your PDA to get what you want, you will just talk to it.
When is my next appointment? What is Tom’s number?
“It will be significantly faster,” Houy said. “The reason that people would choose voice over everything else is [simply] ease of use.”
did you get that?
With a variety of financial institutions moving toward voice-enabled on-line stock trading, the potential for mistakes is huge, especially in the wireless world. Not to mention the nightmare of drivers screaming down a highway while yelling into their cell phone, “You #@!& computer! I said buy 1,000 shares of Verizon not Horizon.”
Landlines are simply less prone to error than their wireless counterparts. “The (wireless) system is tuned very differently because you have to make assumptions. For example, when you say ‘Boston’ if the ‘B’ was dropped it would sound exactly like Austin. Well one artefact of a cell phone is you do have some dropped packets and you might lose that first syllable,” Chambers explained.
OnStar, a wholly-owned subsidiary of General Motors Corp., is offering its virtual advisor platform in some GM cars. According to Ed Chrumka, manager of advanced technology for the Troy, Mich.-based company, the key to success was to crawl before walking. The technology has no screen or display, just a three button module. He said the company worked closely with linguists and what he called human factor engineers to make sure OnStar had an extremely easy to use conversational interface.
The technology is speaker independent and allows for continuous speech, meaning users do not have to train the system to their voice.
“It doesn’t recognize a specific person’s voice; it recognizes the grammars that are used,” he explained.
The majority of the horsepower is located off-board at the OnStar servers, which are cell connected to the car using a proprietary data protocol. A user can contact the virtual advisor to find news, weather, sports and stock quotes. There is also an e-mail reader, letting drivers listen to messages while they are caught in traffic.
“It is interesting to note that only a very small percentage of people actually want to surf the ‘net or do e-mail in their vehicle but that small population of users use it a lot, every day, so as a percentage of usage it is actually pretty high,” he said.
The company plans on adding more features including real-time traffic updates and potentially creating partnerships with financial institutions in order to supply real-time financial transaction information to subscribers.
“We also install a specific algorithm on board that helps the voice recognition performance reach an optimum level,” he said. This helps reduce the effects of ambient noise and screaming kids.
“[The application] is not just accommodating background noise but also allowing people to give answers in a very wide variety of ways so you can say yes, yeah, OK, that’s right,” said Paula Skokowki, vice-president of marketing at General Magic Inc. The Sunnyvale Calif.-based company developed the speech application for the OnStar portal.
As the various forms of voice-enabled technologies begin to converge and overlap, creating a standard protocol is extremely important. That is where voice XML comes in.
“The biggest thing happening in the wireless Internet is the development of open-standards voice XML,” Eibsvik said.
“VoiceXML offers interoperability and that is really I think what will unleash the voice Web market.”
This means that programmers can write an application once and run it anywhere and access it from any phone, he explained. Instead of silos of separate information we will have the ability to voice link all these applications, he said.
“We have been using the phone for a hell of a long time, I think [the voice Web] is really going to revolutionize the phone.
“It is slowly happening in bits and pieces but essentially what you are going to see is the eradication of the dial pad,” he predicted.
“The concept of dialling phone numbers is going to go away.”