Tuesday, March 19, 2013

Voice recognition: A Status Report

Siri, Dragon, Samsung, Microsoft Windows, … , the list goes on and on. Nearly everyone in the computer device game recognizes that as the hardware get smaller and smaller—and their onscreen keyboards shrink proportionately—more reliable voice recognition will be a game changer. It has to be done: In the near future, smart phones—or a part of them anyway—will shrink again and be strapped to your wrist ala Dick Tracy. Google Glass, in which the display is projected on your glasses, is perhaps the ultimate example of shrinking displays, at least until contact lenses become computer displays.

In a quite different domain, coming as soon as the next decade or so, many new kinds of devices will call out for voice recognition—perhaps literally. Think of your automobile, your dishwasher, the lights in your home, things like that.

But the bottom line today is that most voice recognition is crappy. Even Microsoft's own speech recognition engineers admit its failings. Speech recognition at Microsoft now has in fact been relegated to the Ease of Use section of Windows Control Panel, downplaying it and implying, without quite saying so, that it is aimed at people unable to use a keyboard. Yet Willi who is not all handicapped—unless one chats with his wife—has dictated literally hundreds of thousands of words, in some cases complete books. He hasn't used a keyboard for years. He is dictating this blog post using Microsoft's speech recognition, which is considerably better than you might expect, both to give commands and to dictate words, though it still has a way to go. He has also used both Siri and Dragon to some extent (enough to know that they're not much good—yet).

As to speed, there's no question in Wili's mind that he can dictate faster, using Microsoft's technology, than the fastest typist, assuming there is a prepared text for both, and that includes corrections. But that's not really the point; when dictating without a prepared text, one can only speak as fast as one can think, and for most of us that's not very fast. Another factor in speech dictation that is often overlooked is that what one writes is often more natural and smooth than what one types on a keyboard. This can be important for writers.

To get one's head around the current status of speech recognition it is important to understand a few basics:

First, the distinction between speaking commands and dictation. Commands are far easier to understand for the simple reason that there are a limited number of them from which to pick, so that when you speak, it is a relatively simple task to determine which you mean; speech recognition is a statistical game. It is a pretty simple job to discriminate between "Lights on" or "Lights off".

Dictation is a far more difficult animal because when you say something the realm of possibilities includes every word in the language you're using (and yes you can use languages other than English; it's a global world now in the computer business). In the English language there are, give or take, 200,000 words. There are more, really, if one counts medical terms and other scientific esoterica. And then, with dictation there's the added difficulty of distinguishing what you mean when you say a homonym such as "to", which could be interpreted as "to", "too", or "two ", depending on the context. So you see this is not a simple proposition. Yet surprising progress has been made in this direction.

The next consideration is ambient noise: speech recognition in a quiet room is vastly easier to understand than speech in a work cubicle, an automobile, on the street, or in a bar—in roughly that order of difficulty, unless you inhabit quiet bars, as does Willi when he is able.

Finally, the quality of the microphone that is trying to listen to what you're saying is often important, depending upon the other three considerations listed above. Willi, at this moment, is in a quiet room and is being listened to by the simple but profound little microphone built into the edge of his computer a Microsoft Surface Pro. Yes, there are noise canceling microphones; Willi has tried them; they don't help a lot.

These are the basic parameters with which to evaluate the practicalities of speech recognition. To sum up the current status of speech recognition one should take away the thought that more work is needed both on the software, which is a complicated matter in itself, but even more work is needed on some sort of noise filtering hardware. This remains the single biggest roadblock to practical, everyday use of this technology.

One does not always have a quiet room to work in. And it is "on the go" when one most needs this technology. Willi thought about a throat MIC, as was used in airplanes for WWII where the noise from the engines was overwhelming. Perhaps a new such device could be made as cool-looking throat ornamentation—a leather choker encrusted with fake diamonds. It would take sound directly from your larynx. Never underestimate the benefits of fashion to sell an idea.

No comments: