TableOfContents(maxdepth)

Release Roadmap

PocketSphinx 0.5.1, SphinxBase 0.4.1

To be released soon (September 2008)

This release cleans up a lot of problems with the official release. A partial list follows:

PocketSphinx 0.5, SphinxBase 0.4

Released July 8, 2008.

This version is binary and source incompatible with previous versions. Memory efficiency and thread safety are major goals.

Library SONAMEs are libpocketsphinx.so.1 and libsphinxbase.so.1 as various API calls and data have been removed.

PocketSphinx 0.4, SphinxBase 0.3, Sphinx3 0.7

PocketSphinx 0.4 and SphinxBase 0.3 were released on August 16th, 2007. Library SONAMEs are libpocketsphinx.so.0 and libsphinxutil.so.0.

A partial list of features/requirements follows:

Internals

Memory Usage

Memory usage is currently a lot higher than it needs to be. There are two things we can do to fix this. First, the model data structures, which should be read-only (and therefore memory mapped and shareable between processes) are not, because there is some precomputation that has to be done on them. Second, a lot of very large search-related data structures are preallocated, along with huge arrays for acoustic features.

The latter is a bit harder to fix since we really do want to allocate things like the backpointer table in one big chunk of memory, to avoid the overhead associated with malloc() on a zillion small objects (this is a much worse problem on WinCE with its broken standard library).

Precompiled model structures

The biggest part of the acoustic model, namely the mixture weights, is already read-only (in the form of the "sendump" file which SphinxTrain now knows how to generate, though the file format isn't great).

Precomputing the codebooks is not a great memory saver since they are very small for semi-continuous and subvq models, but there is another reason we should do it. Currently the only part of fixed-point computation that is really increasing the error rate over floating-point is GMM computation, and the reason for this is that we are using a hard-coded radix point (16.16) for the mean and variance parameters and just crossing our fingers that they won't exceed that range. So we lose a lot of precision in the calculations. If we precompute the codebooks we will know the range of values ahead of time and can get an optimal quantization (although for speed we might just express it as a bit shift and bias term).

Parts of the language model are memory mapped and other parts aren't. This code is a terrible mess, leaks (a bit of) memory, and also needs to be merged with Sphinx3.

Loading the dictionary is the most time-consuming part of initialization and it is all heap-allocated. The dictionary itself is just a hash table and not a huge one (although we could precompile it with a perfect hash function - [http://cmph.sourceforge.net/index.html CMPH] could be used for this), but the "context tables" which determine the set of initial and final triphones for each word are very large and take a long time to build. They also should ideally be built with reference to the original decision trees for the acoustic model, so that they don't just back off to context-independent phones for unknown triphones.

While the structure of the lexicon tree is fixed for any given dictionary, it can't be read-only because it needs to reference all of the HMM structures. Building it from the dictionary is probably not much more time-consuming than reading it from a file, so this is a low priority.

Search Optimization

Algorithmically speaking the first-pass search in PocketSphinx is about as fast as it can possibly be. Any optimizations to this component are going to have to be carried out at the level of HMM evaluation (which is already where PocketSphinx spends most of its time for moderately-sized vocabularies). This work is being carried on in the general framework of [:BaseHMM:Merging and Optimizing HMM implementations] between all Sphinx decoders.

In general decoding is highly dependent on memory bandwidth. The reason for this is that the combination of acoustic models and search graph is too large to fit in most processors' cache, and we usually end up touching every part of the model/HMM space in the course of large-vocabulary search. For this reason, instruction-level optimization of HMM evaluation isn't as useful as you might think.

There are four major processes associated with each frame of search, each of which consumes a roughly equal amount of time:

There are four major heap-allocated data structures which are touched by these processes:

In addition to that there are read-only (though currently still heap-allocated) structures that are read by them:

I believe that a lot of the slowdown in search occurs because certain parts of the search algorithm touch a number of these bits of memory at the same time or switch back and forth between them in rapid sequence, thus disrupting locality of reference.

DHDWiki: PocketSphinx (last edited 2008-09-28 22:56:59 by DavidHugginsDaines)