Small Business Resources, Business Advice and Forms from AllBusiness.com

Business Exchange

Trends: Gesture- and Speech-Recognition Software

By:MARK MERRITT
Publication: Presentations
Date: Sunday, October 1 2000
Imagine creating an entire speech by merely reading it to your computer.

Or changing PowerPoint slides with a wave of your hand instead of the click of a remote. Or creating an entirely interactive presentation, complete with 3D virtual "objects" that you can smell. This sort of technological wizardry is no longer the dreamy stuff of science fiction. The next generation of voice- and gesture-recognition technology is making it possible and could soon open a new frontier in the presentation world. You don't have to ditch your laptop and remote just yet, but it wouldn't hurt to start preparing for the day when you will.

When the original "Star Trek" TV series began airing in the 1960s, a businessperson called the show's producers to ask how the doors on the show opened by themselves, because he wanted to install the mechanism in his house. The producers told him their secret was probably too impractical for him to consider: They had two people backstage who pulled on the door panels to open and close them as needed. Disappointed but undaunted, the businessperson saw an opportunity and proceeded to invent a real-world way to do the same thing. Now, half the doors in the world open automatically, no one thinks twice about it and the person who made this science-fiction dream come true is no longer just a businessperson, but a rich businessperson.

Perhaps it is a sign of our limited imagination that so many machines we invent look and act like devices science-fiction writer Isaac Asimov described. We shouldn't be surprised then that a number of people in this world are trying to figure out how to do what science-fiction characters have been doing for decades: talk to computers and other machines, have them "listen" and have them do what we want, the way we want, without being so annoyingly literal about everything. It just so happens that many of the people who are trying to narrow this communication gap between people and their machines are doing it in ways that may soon benefit many professionals, including presenters and public speakers.

Speak and spell

A computer that can recognize and respond to the human voice -- otherwise known as a voice- or speech-recognition system -- has been a concept for decades, of course, and in some spheres such computers are already being used on a regular basis. Every time you dial for telephone information and the voice asks for your city and state, a voice-recognition system digitizes your reply and immediately displays it on an operator's screen. Many large companies use similar systems in their customer-service departments to screen and direct phone traffic. And even though most Macintosh users don't like or use it much, Mac's Speakable Items has been around since the early 1990s, allowing people to open and close applications using voice commands.

For many years, however, the Holy Grail of speech recognition was the accurate translation of speech to text. A disarmingly simple concept, speech-to-text translation has proven to be a technologically vexing endeavor. And, judging from the September issue of Macworld, it still is. In Macworld's Feedback section, a reader pokes fun at his Windows-based voice-recognition software thusly: "Eye am using a new ViaVoice my IBM speech program. It works quite well as you can see by the water. I think you for an Fiat's Stanton article that made by the Senate in." To which Macworld responds: "I'm happy two here that you are using ViaVoice sucks S fully. I myself use beach-recognition software for Windows, which is at least a generation a head of Max. Queerly, speech recognition is the technology of today and 2 Maura!"

And yesterday, for that matter. In 1984, IBM demonstrated a 5,000-word vocabulary-recognition system that claimed a 95-percent accuracy rate for dictation, but it never caught on. "You had to speak incredibly slowly, which was totally unacceptable, and nobody liked it at all," recalls David Barnes, IBM senior product manager and a ViaVoice evangelist.

But now it works -- really

Since then, faster computer processors, less-expensive memory and better microphones have made continuous speech-to-text translation possible, but not entirely free of snafus. IBM's dictation product, ViaVoice -- the one lampooned by the Macworld reader -- supposedly has a vocabulary of 64,000 words (the average U.S. citizen's vocabulary is about 5,000 words). For the program to work well, however, users have to spend considerable time "teaching" the program their own personal speech patterns and vocabulary quirks. Barnes, who has been an on-the-road presenter for more than 15 years and given nearly 2,000 presentations, regularly uses the software to dictate his presentations. He reports few problems. "I just kick back in my chair, look out the window, picture the audience and start talking," he says. Then again, not many people have the opportunity (or incentive) to use ViaVoice as frequently as Barnes.

According to Barnes, though, most people -- particularly professional speakers -- can expect to use the software with few problems. "The person is required to speak good English, period," says Barnes, although it helps to speak "good English" in calm, even tones. If you happen to speak with an accent, you may encounter a few extra challenges, but he claims ViaVoice is remarkably adept at understanding nonnative English speakers. Barnes says he was recently in Thailand to show the product and, although he didn't understand most of the English spoken there, ViaVoice did. How? In addition to matching vocabulary, ViaVoice listens for likely combinations of words and calculates the odds of certain words appearing next to each other.

Machines you can talk to

Today, these syntactical guesses may occasionally produce phrasings that are hilariously wrong. But that hasn't stopped IBM from taking the future of voice-to-computer interaction seriously. Although consumers won't see these devices for a while, IBM has already developed some products that are bringing the science-fiction future closer to the mainstream. Barnes himself already has a personal digital assistant (PDA) that is completely voice-controlled. When he asks what his next appointment is, it tells him in a computerized voice. The device also reads back memos and e-mail messages. And at this year's Olympics, IBM will be demonstrating a voice-controlled coffee machine that people can walk up to and say, "Double-shot decaf mocha. Skim milk," and out it will pour.

According to Barnes, within two years you will be able to use a cell phone to dictate a message and e-mail it. You will also be able to have a two-person meeting transcribed in real time with 100-percent accuracy. To test the viability of products for the consumer market, IBM has even built a mock house that is completely voice-activated. The engineers walk in and turn on the lights, television, everything, by speaking. And with the advent of Bluetooth technology, at some point everything will be able to connect to everything else, so the house wouldn't just respond to commands, but "think" for itself -- for instance, as you run out of coffee beans, your coffee maker and refrigerator could, in a sense, confer, then contact an online grocer to order more beans.

A cure for illegible handwriting

Another company trying to broaden the applications for voice-recognition software is IQMax in Charlotte, N.C. Founded last year, this new company has bet its livelihood on the potential popularity of voice-to-text software for PDAs, such as Handspring and Palm Pilot. IQMax is targeting its technology toward medical professionals who constantly need their dictated notes transcribed. Although the product is still in its beta phase and does not yet have a name, the idea is to develop a device into which a doctor can speak and immediately have her spoken notes translated to text. The product won't just save time or replace human transcribers with machines -- it could also save lives, says Allen Thomas, vice president of marketing and business development. "It's been calculated that there are close to 90,000 cases a year where a patient is injured, or even dies, because of a mix-up involving incorrect or illegible notes," he explains. Assuming doctors who don't mumble and software that doesn't hear "cyanide" when the prescription calls for Sudafed, theoretically at least some of these deaths could be avoided.

Next slide, please

Another company investing considerable money and effort into speech recognition is Lernout & Hauspie (L&H), Burlington, Mass. L&H's premier product is VoiceExpress, which transcribes text from talk, as other products do, but can also be used for command-and-control applications. This means the software must be able to differentiate between general speaking and direct commands. To distinguish one form of speech from another, the menu function includes 20 or 30 common commands for such programs as Microsoft Word or QuarkXPress. The user can designate use of those commands or can program his own commands.

"Although people still see [speech-recognition technology] as a fantasy, a lot of this is here now and can help now," says Hank Pokigo, L&H senior product manager. According to Pokigo, VoiceExpress contains commands for the entire Microsoft Office suite, although the range of the commands is limited. For instance, you can dictate a PowerPoint presentation using VoiceExpress and expect the text to be transcribed relatively accurately. But if you say, "Bullet point. Insert text. Swipe right. Insert animation. Use cash-register sound effect," the result won't necessarily be quite what you envision, although improvements to the command portion of the program are being made, Pokigo says.

Gesture recognition: The next frontier

Presenters may likely use speech-recognition capabilities in the future to create presentations or have notes quickly transcribed, but burgeoning gesture-recognition technology may eventually change the way people deliver presentations altogether.

Whereas speech recognition requires you to use your voice, gesture-recognition technology responds to body motions, interpreting hand gestures and other movement as programmable information. For some time, animation specialists and designers have been using the attached-bodysuit technique to make animated characters look as though they are moving naturally within virtual worlds. The disadvantage of this technique is that the technology is expensive, cumbersome and extremely complex. The next generation of gesture-recognition technology may change that by enabling people to interact with digital environments in ways that are far more practical and economical.

Goodbye, remote

Picture yourself sitting in an auditorium, 10 feet from the stage, watching presentation visuals shift from a slide to animation, back to a slide, then to a video clip -- but the presenter doesn't have a remote in his hand and no one is backstage advancing the program. The transitions are seamless; the program seems to be running itself. Sound like something out of "Futurama"? Well it is, sort of. But new gesture-recognition technologies could make it possible for you to run a wireless, hands-off presentation like this sooner than you may think.

Cybernet in Ann Arbor, Mich., has been doing research and development in virtual-simulation technology for more than a decade, primarily for such government agencies as NASA and the U.S. Army. Through its efforts to create a more intimate and seamless virtual-combat experience, the company has serendipitously created a gesture-recognition device that translates hand gestures into PowerPoint commands.

Who's changing the slides?

For a long time, combat simulators required the participant to be "tagged" -- that is, connected to a program through helmets and bodysuits. Cybernet's latest research has focused on getting rid of the suit to make virtual worlds more realistic. That means creating software that can "read" a person's body movements and sync it with his virtual surroundings. A by-product of this research is the development of an interface that enables the user to control a PowerPoint presentation wirelessly with simple hand gestures.

"You can make a standard 'come here' gesture to advance a slide or a 'move away' motion to retract it," explains Greg Emery, Cybernet senior vice president. In fact, you can run an entire presentation without touching anything -- no remotes, no buttons, nothing.

"It's just software and a [tracking] camera, so there's not much of a limit to where this can be used," says Charles Cohen, vice president of research and development. Cohen usually uses the program when he gives his own talks, and audience members tend to react with a mixture of confusion and curiosity. "It usually takes everyone a few minutes before they ask who the hell is changing the slides," he says.

As with IQMax, Cybernet hasn't yet put a trade name on its technology, because the company normally doesn't develop and launch products on its own -- it licenses its products to others. The company is looking for a partner to help develop and distribute a marketable PowerPoint gesture-recognition product and would eventually like to see it installed as a permanent feature of all presentation packages.

Nokia's pod-phone

Aiming to create a much broader audience for gesture-recognition technology are the self-proclaimed "experience engineers" at Digital Tech Frontier (DTF) in Tempe, Ariz. The DTF folks envision a world in which people read holographic digital newspapers and shop at virtual grocery stores where people can see -- and even touch -- the merchandise. For now, however, the company must content itself with devising creative virtual interfaces for companies that are willing to pay for the service, such as cell phone giant Nokia Communications.

Nokia recently wanted to display its newest product, the Nokia 7100, to the European market, and it wanted to use cutting-edge technology to complement what it considered to be a cutting-edge phone. This led DTF to create what it calls a gesture-recognition interface (GRI) pod -- think of it as an extremely interactive kiosk.

Smells like ... opportunity

The GRI pod is a large, cocoon-shaped booth, with an opening on each side. The viewer leans back against a support and faces a built-in monitor. For the Nokia demonstration, a virtual Nokia phone bounced across the monitor; the viewer had to grab the phone before the demonstration would continue. "We designed [the pod] so [Nokia executives] would understand that gesture recognition won't do anything if you just sit there," explains Scott Jochim, DTF creative director. By the same token, he says, "They can interact with virtual objects in the real world, with no helmets, no gloves or suits. But just like reality, you have to turn on the phone before anything happens." When Nokia execs did grab the phone, the presentation started. DTF added "rumbles" in the seat to coincide with the presentation's sound and, for an extra dose of reality, olfactory sensors worked in tandem with the DVD presentation to create the smell of things onscreen. At one point, the screen showed leaves cascading from a tree, and the pod-sitter could smell fresh cut flowers.

Fun for the whole company

The presentation was so successful that DTF and others are thinking about other ways they can use this technology for marketing or demonstration purposes.

"[The Nokia people] were stunned. Many executives compared it to the 'Back to the Future' ride at Universal Studios," says Jochim, laughing. "We were giving a presentation -- nothing by any means entertaining -- but comparing it to a multimillion-dollar ride showed that they were given information they remembered, while having fun," he says.

Such visually elaborate technology may seem a bit extravagant to advertise some phones, but Jochim stresses that using this method can add fun to a standard business meeting and is not as expensive as you might think. The actual motion-recognition technology costs about $500; the rest is creative labor.

Shooting fireballs? No problem

As for the technology's applicability in the rest of the professional world, Jochim says it wouldn't be difficult to jazz up a regular podium presentation using DTF's technology. For instance, a presenter could easily be chroma-keyed onto a screen. Then video could be run across the screen, so the person onstage could look as if she is floating around onscreen, pointing things out, much the way weatherpeople do on television. "It would definitely make presentations more dynamic and entertaining," says Jochim. "Let's say I wanted to raise my hand and have fireballs shoot out to some key points on screen. That would be easy."

%$#!@&*#!

Practically speaking, only a handful of viable applications exist for voice- and gesture-recognition at the moment. But the technology is certain to evolve and, given the direction of current research and development, it isn't difficult to predict that presenters and companies hungry for something different will soon discover and embrace the possibilities. Indeed, voice- and gesture-recognition technology is simply the next level of evolution in our never-ending quest to erase the barriers between people and their machines. Many of these technologies are still in their infancy and, as they mature, science-fiction writers are going to have to think long and hard about what might replace them. For now, though, whenever people interact with technology, there are still some choice words and gestures that it's just as well our computers don't understand.

In addition, make sure to read these articles: