Release Roadmap
PocketSphinx 0.5.1, SphinxBase 0.4.1
To be released soon (September 2008)
This release cleans up a lot of problems with the official release. A partial list follows:
- Make configuring --without-python work properly
Add accessors to get the fe_t and feat_t structures, which are useful
- Fix various bugs in FSG mode and elsewhere
- Bring back -logfn and argument files
- Bring back -mfclogdir and -rawlogdir (still TODO)
PocketSphinx 0.5, SphinxBase 0.4
Released July 8, 2008.
This version is binary and source incompatible with previous versions. Memory efficiency and thread safety are major goals.
Library SONAMEs are libpocketsphinx.so.1 and libsphinxbase.so.1 as various API calls and data have been removed.
- Include GStreamer support (DONE)
- 16-bit fixed-point DSP support (DONE)
Language model code in SphinxBase (DONE)
- Improved Python integration (DONE)
- Continuous density models supported (DONE)
- SubVQ will be removed for the time being, until it can be unified with SCHMM computation (DONE)
Reduced code size (DONE and DONE! 800k -> 380k)
- Reduced memory footprint (DONE)
- No performance regressions (DONE)
- Various API breakage
- Depublicize all sorts of internal functionality that is subject to change (DONE)
Re-entrant decoding API for PocketSphinx (DONE)
- Posterior probability based confidence scoring (DONE, but not extensively tested)
- Regression tests:
- WSJ5k Nov '92 test (bigram):
- i386-linux (batman.speech.cs.cmu.edu, Pentium4, 3.0GHz): 2-6% faster
- amd64-linux (redwood.speech.cs.cmu.edu, Opteron 852, 2.6GHz): 3-8% faster
- armel-linux (Nokia N800, TI OMAP, 400MHz): 14-18% faster
- bfin-uclinux (blackwing, STAMP BF533, 500MHz): 0-3% faster
- Win32 (lima.lti.cs.cmu.edu, VMWare, Windows XP): 3% slower (possibly due to use of DLL rather than static library?)
- New default language model based on retrained WSJ data (DONE)
PocketSphinx 0.4, SphinxBase 0.3, Sphinx3 0.7
PocketSphinx 0.4 and SphinxBase 0.3 were released on August 16th, 2007. Library SONAMEs are libpocketsphinx.so.0 and libsphinxutil.so.0.
A partial list of features/requirements follows:
- (mostly) unified HMM implementation with Sphinx3 (DONE)
- Support feat.params files in acoustic models (DONE)
- WSJ5k Nov '92 test (bigram):
- i386-linux (lima.lti.cs.cmu.edu, Pentium4, 3.0GHz) fast: 9.71% WER, 0.11 xRT
- amd64-linux (redwood.speech.cs.cmu.edu, Opteron 852, 2.6GHz) fast: 9.71% WER, 0.07 xRT
- arm-linux (iPaq 3670, StrongARM SA1100, 206MHz) fast: 9.69% WER, 2.02 xRT
- armel-linux (Nokia N800, TI OMAP, 300MHz) fast: 9.69% WER, 1.59 xRT
- bfin-uclinux (blackwing, STAMP BF533, 500MHz) fast: 9.69% WER, 1.31 xRT
- powerpc-darwin (Powerbook G4, PowerPC 7400, 1.0GHz) fast: 9.71% WER, 0.29 xRT
sparc-solaris (mangueira.speech.cs.cmu.edu, UltraSparc III, 750MHz) fast: 9.71% WER, 0.25 xRT
- Win32 (lima.lti.cs.cmu.edu, VMWare, Windows XP) fast: 9.71% WER, 0.10 xRT
- Should compile and run on WinCE using eVC++ 3 (DONE)
- Performance is something of a lost cause because WinCE sucks, i.e. someone else can do it
- Included acoustic models:
- TIDIGITS (8kHz audio) (DONE)
- WSJ SI-284 (8kHz audio) (DONE)
Internals
Memory Usage
Memory usage is currently a lot higher than it needs to be. There are two things we can do to fix this. First, the model data structures, which should be read-only (and therefore memory mapped and shareable between processes) are not, because there is some precomputation that has to be done on them. Second, a lot of very large search-related data structures are preallocated, along with huge arrays for acoustic features.
The latter is a bit harder to fix since we really do want to allocate things like the backpointer table in one big chunk of memory, to avoid the overhead associated with malloc() on a zillion small objects (this is a much worse problem on WinCE with its broken standard library).
Precompiled model structures
The biggest part of the acoustic model, namely the mixture weights, is already read-only (in the form of the "sendump" file which SphinxTrain now knows how to generate, though the file format isn't great).
Precomputing the codebooks is not a great memory saver since they are very small for semi-continuous and subvq models, but there is another reason we should do it. Currently the only part of fixed-point computation that is really increasing the error rate over floating-point is GMM computation, and the reason for this is that we are using a hard-coded radix point (16.16) for the mean and variance parameters and just crossing our fingers that they won't exceed that range. So we lose a lot of precision in the calculations. If we precompute the codebooks we will know the range of values ahead of time and can get an optimal quantization (although for speed we might just express it as a bit shift and bias term).
Parts of the language model are memory mapped and other parts aren't. This code is a terrible mess, leaks (a bit of) memory, and also needs to be merged with Sphinx3.
Loading the dictionary is the most time-consuming part of initialization and it is all heap-allocated. The dictionary itself is just a hash table and not a huge one (although we could precompile it with a perfect hash function - [http://cmph.sourceforge.net/index.html CMPH] could be used for this), but the "context tables" which determine the set of initial and final triphones for each word are very large and take a long time to build. They also should ideally be built with reference to the original decision trees for the acoustic model, so that they don't just back off to context-independent phones for unknown triphones.
While the structure of the lexicon tree is fixed for any given dictionary, it can't be read-only because it needs to reference all of the HMM structures. Building it from the dictionary is probably not much more time-consuming than reading it from a file, so this is a low priority.
Search Optimization
Algorithmically speaking the first-pass search in PocketSphinx is about as fast as it can possibly be. Any optimizations to this component are going to have to be carried out at the level of HMM evaluation (which is already where PocketSphinx spends most of its time for moderately-sized vocabularies). This work is being carried on in the general framework of [:BaseHMM:Merging and Optimizing HMM implementations] between all Sphinx decoders.
In general decoding is highly dependent on memory bandwidth. The reason for this is that the combination of acoustic models and search graph is too large to fit in most processors' cache, and we usually end up touching every part of the model/HMM space in the course of large-vocabulary search. For this reason, instruction-level optimization of HMM evaluation isn't as useful as you might think.
There are four major processes associated with each frame of search, each of which consumes a roughly equal amount of time:
- Marking active output distributions (senones)
- Calculating output distributions (senone scores)
- Updating state scores and propagating histories inside HMMs
- Propagating histories across HMMs
There are four major heap-allocated data structures which are touched by these processes:
- Bitvector and list of active senone IDs
- Vector of senone scores
- HMM states (scores and histories)
- Search graph nodes
In addition to that there are read-only (though currently still heap-allocated) structures that are read by them:
- Model parameters
- Language model
- Dictionary, context tables, etc.
I believe that a lot of the slowdown in search occurs because certain parts of the search algorithm touch a number of these bits of memory at the same time or switch back and forth between them in rapid sequence, thus disrupting locality of reference.