Microsoft pitches voice spec

Microsoft Corp. has welcomed a new addition to its server family: the Speech Server.

Running on Windows Server 2003, the first public beta of Speech Server will ship with Beta 3 of Microsoft’s Speech Application SDK (Software Development Kit) in what signals speech technology’s return to the corporate agenda.

Due for manufacturing release before mid-2004, the product will include a text-to-speech engine from SpeechWorks International Inc. – Microsoft’s own speech-recognition engine – and a telephony interface manager. The offering will also include middleware that is being designed in partnership with Santa Clara, Calif.-based Intel and Dallas-based Intervoice to connect the Microsoft product to an enterprise telephony infrastructure.

But it is the server’s SALT (Speech Application Language Tags) voice browser that sets Microsoft apart from the standards crowd.

Rather than adhering to VXML (Voice XML) – the current W3C standard for developing speech-based telephony applications – Speech Server is compatible only with applications that use the specifications developed by the SALT Forum, of which Microsoft is a founding member.

The SALT Forum has submitted its specifications to a W3C working group, but they are far from becoming a standard.

“The process could take years,” admitted James Mastan, director of marketing for the speech technologies group at Redmond, Wash.-based Microsoft.

The SALT specification was originally targeted at the multimodal market for browsing the Web on handheld devices. The theory was that users required multiple ways to interface with smaller devices and that voice would be chief among them, but the market for multimodal handhelds has not materialized.

Microsoft executives believe the SALT-based Speech Server is ideally suited to call centers where the cost of using live operators is becoming prohibitive.

An InStat/MDR research report stated that live agents cost US$1 to $5 per call as opposed to 20 cents for a speech-recognition system.

“This is not a desktop solution but an enterprise application,” said Elizabeth Herrell, an analyst at Forrester Research Inc. in Santa Clara, Calif.

Bill Meisel, a principal at TMA Associates Inc., a leading speech technology research company based in Tarzana, Calif., said enterprise voice adoption will increase due to Microsoft’s market influence. Yet, because Speech Server will compete directly with established VXML applications, Microsoft’s actions will make speech technology adoption a more complex exercise for the enterprise, according to Meisel.

Competing speech technology vendor IBM is a case in point. Big Blue supports VXML and the W3C standard, according to Gene Cox, director of mobile solutions at Armonk, N.Y.-based IBM Corp.

Cox said significant VXML applications already exist in the enterprise at companies such as AT&T Corp., General Motors Corp.’s OnStar division, and Sprint PCS.

“VXML conforms to all W3C royalty-free polices. But SALT is like Internet Explorer; it is free as long as you buy Windows,” Cox said.

The debate over which technology to use will not be fought out at the customer level, said Forrester’s Herrell, but rather by developers.

“Customers just want a solution that works. Developers will decide which platform to use based on its quality, and for that, it is too early to tell,” Herrell said.

VXML is a separate language that developers must learn, TMA’s Meisel said. For VXML to support Web-based applications, such as those residing in call centers, VXML must connect with the back-end servers.

“People are using J2EE to drive VXML applications and to provide a more standard interface to Web services,” Meisel said.

Irvine, Calif.-based NewportWorks Inc., an information service provider for the real estate industry, is one example of an IBM customer that will be hard to shift away from Voice XML. According to CEO Ken Stockman, the company could not exist without Voice XML. NewportWorks aggregates the data from the MLS (Multiple Listing Service), uses IBM’s WebSphere Speech Server to convert the listings for voice access, and sells the service to real estate agencies. The MLS data is either sent through a real estate industry XML gateway called RETS (Real Estate Transaction Standard), dumped into a flat file and sent via FTP, or MLS creates a SQL view.

“Our service couldn’t exist without speech technology. The economics of a call center don’t work,” said Stockman, who added that the company has not investigated SALT-based technologies. Regardless, NewportWorks has thrown its support behind Java, which provides most of the heavy lifting for its solution.

A Java application layer manages the interaction between client and data with Java dynamically generating the VXML, using IBM WebSphere to handle the calls.

Stockman said the learning curve on VXML for developers was negligible. Microsoft, on the other hand, argues that Web developers don’t want to learn a new language. Instead, they want SALT tag plug-ins for existing Web-based applications.

According to Intervoice, the argument may be resolved through tools such as its Invision, which allows a developer to automatically generate VXML and to possibly generate SALT code in the future.