The goal of this project is to implement a text input method combining speech recognition with multitouch gestures.
Motivation
The major problem with speech recognition is that it makes errors. Therefore, basically all the engineering which goes into automatic speech recognition systems revolves around techniques to minimize the error rate.
Many of the errors made by ASR systems fall into two broad classes:
ModelingErrors: the correct transcription is actually known to the system, but because the acoustic or language models used are not optimal, an incorrect transcription is assigned a higher score.
SearchErrors: the correct transcription would be assigned a higher score than the system's best hypothesis, but because of the inexact nature of the search algorithm, the system is not able to find it.
For this reason, many state-of-the-art ASR systems work in multiple passes, by using an inexact model to generate a word lattice structure which is a compact representation of the entire space of sentences considered by the system, and then performing a more exact search using a more powerful but less efficient language model over this lattice.
This project implements an easy and intuitive way for humans to perform their own "third pass search" over a word lattice, using either a touchscreen or a pointing device. This involves using gestures to "pull apart" incorrect sections of a sentence and choose an alternative transcriptions. The interaction technique is as follows:
- Dictate a sentence - the single best recognition result is displayed on-screen (probably in a standard window though it could be overlaid on the desktop)
- "Pull out" a section of the result, either by dragging fingers apart (iPhone-style) or by dragging a single pointer away from the centerline. The best result splits into a "cloud" of words around the point where the drag action initiated. A double-click will also expand a region to the maximum possible size.
- Click on the sequence of words you wish to choose instead of the original. These will change color to indicate that they have been selected as the new best transcription.
- By pulling the cloud back together, either by pulling a pair of fingers together or by dragging towards the centerline, the "cloud" collapses to the newly selected recognition result. (double-clicking in an region will also collapse it)
- To merge adjacent results, drag horizontally. This also allows deletion of words, by expanding the cloud into unwanted words and selecting only the desired words.
Status
Where are the screenshots?
Here is a video capture of some very basic interaction: attachment:project4_1.ogg (Ogg Theora video). A somewhat larger video, including more sophisticated actions, can be seen at http://www.cs.cmu.edu/~dhuggins/Projects/project4_2.ogg (also Ogg Theora).
To view these videos on Mac OS X (or Windows with QuickTime), get XiphQt at http://xiph.org/quicktime/download.html. On Windows, get Directshow filters at http://www.illiminable.com/ogg/downloads.html#stable.
Where is the code?
Browse the code at: http://lima.lti.cs.cmu.edu/cgi-bin/viewvc.cgi/project4/?root=svn
Download a tarball of the current revision: http://lima.lti.cs.cmu.edu/cgi-bin/viewvc.cgi/project4.tar.gz?root=svn&view=tar
API documentation generated by epydoc is at http://www.cs.cmu.edu/~dhuggins/Projects/project4-api/
What is and is not done?
Currently this exists in the form of a prototype running on a Linux laptop, possibly using [http://wearables.unisa.edu.au/mpx/ Multi-Pointer X]. In the very near future it will be ported to Nokia InternetTablets running OS2008 (it requires Cairo, so OS2007 will not be supported).
Stuff that is done
- Word lattice input and bestpath search
- Word posterior probabilty calculation
- Vertical expansion of words
- Horizontal merging of words
- Correction of words
- Collapsing clouds of words
- Multitouch event handling using MPX
Stuff that is not done
- Live audio input
- Requires a bit of work on ["GStreamerSphinx"]
- Lattices are probably going to be saved to disk for now
- Efficient language modeling code
- Currently it's using a pure-Python module (sphinx.arpalm) which is very slow to load
- Multitouch interaction
- Simulated multitouch on touchscreens
- Glue necessary to actually use this as an input method
References
This is a variation on the original technique for human-directed lattice search implemented in the Speech Dasher project by Keith Vertanen (http://www.inference.phy.cam.ac.uk/kv227/speechdasher/)