Today, well-engineered speech recognition systems achieve high customer satisfaction and high returns on investment in many customer service areas, including stock trading, flight information, catalog ordering and directory assistance. Although speech automation's potential has become widely recognized, few IT organizations have had the means to build or maintain speech systems, relying instead on expensive services from speech engine vendors or specialist system integrators. One major impediment to speech development efforts was removed when the industry adopted open standards and Web technologies familiar to mainstream IT organizations. However, a larger obstacle still remains: speech development methodologies and tools must improve to address the unique demands of voice user interfaces before mainstream enterprises can reliably deliver high quality speech systems at a reasonable cost.
**********
The First Step: Open Speech Standards
The earliest development approaches required programming in the application program interface (API) specific to each speech recognition engine. This approach burdened developers with low-level, recognition engine-specific details such as exception handling and resource management. Moreover, the proprietary nature of these APIs restricted the flexibility with which enterprises could deploy applications. Most software components had to be sourced from a single vendor and had to be deployed in a single location, and the resulting applications could not be easily ported to other platforms.
The advent of voice languages such as VoiceXML and SALT contributed to a Web-based development process. These languages allow a distribution of responsibilities in a speech system between a voice browser, which performs the speech recognition function, and a server application, which contains the application logic and user interface behavior (expressed in the voice language). As a result, application developers no longer concern themselves with speech engine API calls, but instead are responsible for generating documents that can be executed by the voice browser.
VoiceXML (Voice Extensible Markup Language) is a standard endorsed by the World Wide Web Consortium (W3C) for speech application development. The first specification was released in March 2000 by the VoiceXML Forum (www.voicexml.org/), an industry body that now has 375 member companies, including IBM, Nuance, Motorola and AT & T. The latest version, VoiceXML 2.0, became a W3C recommendation in March 2004. VoiceXML voice browsers are already available through dozens of vendors; in all, a hundred or so vendors provide compliant products. Commercial VoiceXML deployments have been estimated in the thousands.
SALT is a newer standard, proposed by the SALT Forum (www.saltforum.org/), and is somewhat competitive with VoiceXML. The intent of SALT is to facilitate multimodal applications, allowing spoken interfaces to be used in conjunction with a keyboard and a display screen, so that Web pages can be accessed by different client devices. However, SALT can also be used to build voice-only applications, and one of its targets is to simplify speech application development. The major proponent of SALT is Microsoft, but many companies support both SALT and VoiceXML, including Intel, Cisco, HP and ScanSoft. Only a few SALT voice browsers are currently available. The most prominent is Microsoft's Speech Server, which has attracted developer interest due to its integration with Microsoft's .NET framework. To date, SALT has few publicly announced commercial deployments.
VoiceXML is a larger language that contains its own procedural and transport elements. In contrast, SALT is a lightweight extension to existing markup languages, most notably HTML and XHTML. SALT tags are embedded within the HTML DOM (document object model) event and scripting environment, a model familiar to Web developers. Dialog flow is managed by combining SALT elements with DOM object properties, methods and events. This programming approach is well-suited to multimodal applications because visual and speech elements on a Web document are peers. VoiceXML, on the other hand, has constructs designed specifically for speech-only interfaces, such as dialogs with predefined execution flows.
Despite the competition, SALT supports various W3C standards associated with the VoiceXML standard, including SRGS, the W3C speech recognition grammar specification; SSML, the W3C language for controlling TTS (text-to-speech) pronunciation, emphasis and intonation; and ECMAScript, the scripting language specification. Moreover, SALT has been submitted to the W3C's Voice Browser working group, and some of its concepts may be incorporated into the next VoiceXML standard.
VoiceXML and SALT are both presentation layer languages that deliver a number of benefits. First, they are associated with a Web development model familiar to most programmers. Second, they support flexible deployment architectures--the voice browser and server application can be co-located or separated, and can be managed by the same or different entities. Third, they offer the prospect of application portability across different vendor platforms.
Much More Is Needed For High Usability
Despite these benefits, developing speech applications remains a complex undertaking. Industry estimates for delivering a customer-facing speech application of moderate complexity range from 3,000 to 6,000 person hours (including requirements analysis, dialog design, coding, source system integration, audio processing, testing and tuning), and first-time efforts can be considerably longer.
Building a highly usable speech system with existing VoiceXML and SALT tools is costly, slow and difficult. Most tools implement a development model similar to that used for creating a workflow application or a touch-tone menu tree. The developer is provided a palette of dialog components and a canvas on which these components can be sequenced with some transitional logic. The dialog components encapsulate all of the prompts, grammars and presentation code (VoiceXML or SALT) required to collect a particular type of data item, such as a date, dollar amount or credit card number.
Unfortunately, dialog components are usually too atomic--they process a single question and answer containing a single data item. To implement an application of any sophistication, the developer has to manually write new components to handle more complex responses (such as user utterances that contain multiple pieces of information), as well as code the logic for any "off-topic" response; that is, a response that does not directly answer the question posed. For example, consider the following conversation whereby a caller attempts to reconfirm his or her flight details with a human agent:
Agent: Do you have your confirmation number?
Caller: Um, no, but I'm flying out of Dallas on Friday.
[The caller does not provide the confirmation number as requested, but rather gives some details about the flight.]
Agent: OK, departing from Dallas. Are you leaving on Friday, January 28th or Friday, February 4th?
[The agent passively confirms the recognized departure airport and then attempts to clarify the actual departure date.]
Caller: I think my wife made the reservation for the fourth.
Agent: OK, Friday, February 4th. And around what time is the flight?
[The agent realizes that the date alone is not sufficient to retrieve the reservation and asks for the approximate time.]
Caller: 10:30
Agent: Is that a.m. or p.m.?
[The caller response is incomplete, so the agent asks a follow-up question.]
The above example illustrates that user responses in a speech application are much more varied and less structured than in a visual application. Callers may respond in many different ways due to differences in their objectives, the information they have at hand, their level of understanding and their interaction style. To achieve high usability, a speech application must be able to guide callers toward a desired outcome while allowing them latitude in their responses, such as the following elements:
* Callers may provide information in an arbitrary order of their own choosing;
* Callers may use superfluous words in their responses;
* Callers may provide multiple pieces of information in a single spoken utterance;
* Callers may provide--in a single utterance--only a subset of information requested by the application;
* Callers may clarify or correct the application's interpretation of information they have provided; and
* Callers may modify earlier responses in subsequent utterances.
Speech applications present a new user interaction model--one significantly distinct from the graphical user interface (GUI) model well known to all computer users. A voice user interface (VUI) requires specialized design and implementation expertise. An effective interface is critical for success in any speech application and call center system. Inexperienced callers must find the VUI intuitive. The VUI should employ natural and flexible strategies to accept information and to guide callers along the call. It should collect information in a fast and efficient manner by avoiding repetitive or lengthy prompts.
For any customer service call, there might be a straightforward path the developer hopes callers will take. In reality, there are a multitude of different paths callers will actually take, because callers have different goals, different information at hand, different levels of comprehension, or different interaction styles. At each point in the conversation, the caller may answer the current question, or may stray from the direct path by reviewing previous responses, starting another train of thought or jumping to another part of the application. As a result, the richer the desired user experience, the more paths the developer must provide.
Current development tools facilitate the construction of a call path, but still require each path to be manually designed and configured. This approach is not practical for anything more than the simplest interactions, as the number of paths quickly becomes unmanageable. Furthermore, to improve usability, the developer must add, alter or remove paths by hand, which is untenable from a maintenance perspective.
Changing The Equation: A New Approach To Speech Development
A better approach is to drive application development at the conversation level, which shields the programmer from the complexity of designing and implementing every possible call path. In this approach, the development tool would provide a set of services that model the conversation skills commonly encountered in customer service calls, and would construct the call paths accordingly.
For example, a conversation skill is disambiguation, which is the act of determining a single interpretation among two or more plausible interpretations derived from the caller's response. Using current tools, disambiguation would be manually implemented by inserting after each existing dialog an additional dialog that asks the caller to select one value among a set of ambiguous results. By contrast, a tool that understands the concept of disambiguation could automatically generate the disambiguation call path whenever multiple interpretations arise. A more complex conversation skill is goal-seeking behavior, the ability to process the caller's response in the context of the objectives of the conversation. In the previous flight reconfirmation example, this skill allowed the agent to understand the caller's departure airport and date even though the question asked was actually a request for a confirmation number. A development tool that is aware of goal-seeking behavior could automatically construct the numerous possible call paths when preconfigured with an objective, such as obtaining a flight itinerary.
By recognizing and codifying these and many other common conversation skills, a speech development tool would allow developers to implement rich and natural conversations with minimal effort. This approach achieves great savings in development cost and complexity for demanding customer-facing systems.
Open standards such as VoiceXML and SALT are necessary components for the mainstream adoption of speech automation systems. These standards offer a Web-based development model that is already familiar to IT organizations. However, they are not sufficient. Current speech development tools still leave too much of the hard work to the developer: conversation skills and other elements of the voice interface paradigm, such as goal-seeking behavior, flexible recognition, navigation, clarification and correction, must be reinvented and implemented for every speech system. Given the relative newness of the speech paradigm, these requirements can prove over-whelming to the developer. Speech tools and platforms will have to better facilitate the implementation of high usability capabilities before enterprises can consistently deliver high-quality customer service through their speech systems.
If you are interested in purchasing reprints of this article (in either print or HTML format), please visit Reprint Management Services online at www.reprintbuyer.com or contact a representative via e-mail at reprints@tmcnet.com or by phone at 800-290-5460.
For information and subscriptions, visit www.TMCnet.com or call 203-852-6800.
by Patrick Nguyen
Voxify
Partrick Nguyen is the chief technology officer of Voxify, which creates automated agents with the ability to handle advanced customer service calls for call centers. He began his software development career at Australia's Telstra Research Labs. Patrick has also worked for McKinsey & Company, and he has an MBA from MIT's Sloan School and a B.S. in Electrical Engineering from the University of Melbourne.