c4tacel
SphinxTrain is a pretty good trainer for basic semi-continuous and continuous models, and is also fairly well-written, but it is getting crufty in spots.  It is also perhaps too opaque to the end-user, which leads to people not fully understanding the training process or how to modify it.  Most importantly, it does not support some techniques that have been crucial in improving the accuracy of LVCSR over the last 10 years.

I am making plans to partially rewrite it in order to address these concerns, as well as to make it more easily scriptable and maintainable, and to improve its performance.  Because the code is well-written for the most part, and the design is okay, I don't believe it needs to be rewritten, just "renovated" in some ways, so that when we add new features, it doesn't turn into a total mess.  Plans will hopefully take shape here.

== Benchmarks and Regression Tests ==

=== Resource Management ===

=== Wall Street Journal ===

== Maintenance Tasks ==

=== Fix the configuration interface ===

Actually we need to write one in the first place.  The command-line/argument-file interface is frustrating to new users and is a big barrier to people being able to write their own scripts for training.  We should be able to specify parameters in a simple, human-readable text format and override them on the command line if necessary.  The parameters should continue to be self-documenting.

=== Rehost on SphinxBase libraries ===

Various file I/O functions should be moved into this common library as they are duplicated there.

=== Fix bad code in SphinxBase libraries ===

We need a proper library of basic data types and algorithms (i.e. strings, lists, hashes, things like that), so that we can stop using <tt>sprintf()</tt> and fixed-size buffers for everything, among other things.  One thing that makes [http://www.cmuflite.org/ CMU Flite] nice to work with is that it has some very simple dynamic object types built in which are used extensively.  We have some okay hash table, array, and list implementations but we don't use them consistently.

If there were a good, fast, lightweight external library with a sufficiently liberal license which provided these functions to us, it would make sense to use it.  Unfortunately I don't know of one.  [http://gtk.org/ GLib] is '''not''' an option here.  If we were using C++, the STL and Boost would fit the bill very nicely, but there are unfortunately a lot of very good reasons not to use C++ for anything, ever.

If we were using Java or C#, the standard libraries would help us a lot, but then we'd be locked into that particular platform, and scriptability would suffer.  It's an open question how bad this would be in the long run - on Java, there is [http://www.jython.org/Project/index.html Jython], while on C# there is [http://www.codeplex.com/Wiki/View.aspx?ProjectName=IronPython IronPython].  One drawback that I can see for Java is the lack of free, high-performance math libraries.

In general we wish to avoid entangling our users in a thick forest of external dependencies.  Since a lot of things can be carried over from our other projects without much difficulty, I see it as no huge problem to simply roll our own in good old-fashioned C89, as long as we keep it simple.  See [[SphinxBase]] for more details.

=== Vectorize math operations ===

A lot of operations in the trainer are implemented with lots of nested loops and things when in fact they are either element-wise multiplication of vectors/arrays or dot products.  If we use the BLAS for these, then we can automatically take advantage of machine-optimized vector code like [http://math-atlas.sourceforge.net ATLAS] for instance.  Some of the basic interfaces for this have already been created but they don't use the BLAS yet.  Part of the problem is that we may need to convert things to column-major format.  But most of the matrices we use are Hermitian symmetric anyway, so in that case it doesn't matter!

=== Refactor and enhance <tt>bw</tt> ===

The <tt>bw</tt> program is the most important part of the entire trainer, since it implements the Forward-Backward algorithm to collect expected observation counts and other statistics needed for training models.  It does this rather well, but it has been extended in various ad-hoc ways to support new modes of training and adaptation, and its command-line has grown rather large, unwieldy, and often cryptic.

The forward and backward algorithms and the collection of sufficient statistics need to be more cleanly separated.  Currently there is an implicit assumption that the models used to run forward-backward are the same as the ones which will be re-estimated using the resulting counts.  We have broken this assumption in various ways in order to allow speaker-adaptive training and full covariance matrix, but the resulting code is basically a hack.

The Forward and Viterbi algorithms are now done synchronously which allows us to do Viterbi training without repeatedly force-aligning the data.  However, <tt>bw</tt> isn't able to handle pronunciation variants.  This '''should''' be a simple matter of extending the <tt>state_seq</tt> structure to allow it.

We'd really like to be able to do forward-backward over arbitrary DAGs.  This may be necessary for discriminative training.

=== Refactor I/O interfaces for scriptability ===

The current pattern for I/O interfaces (e.g. <tt>s3io.h</tt> and friends) is difficult to wrap in object-oriented scripting languages, because it uses output parameters to return the created "objects", and SWIG doesn't know what to do with these.  We should flip this around, which is already what Sphinx3.x does I think.

== New features ==

=== Python scripting ===

[http://numpy.scipy.org/ Numeric Python] and [http://scipy.org/ Scientific Python] have a lot of powerful tools that are useful for rapid prototyping and experimenting with acoustic modeling techniques.  I've already added some Python code that allows you to manipulate existing acoustic models, but we should be able to hook directly into the "heart" of the trainer with Python scripts.  That is, we should be able to re-use the core of <tt>bw</tt> in Python scripts without having to write out temporary files and run external binaries.  This would make it easier to experiment with different objective functions for discriminative training and suchlike things.

=== Discriminative Training ===

We need an implementation of Extended Baum-Welch.  We also need lattice-based forward-backward in order to compute denominator statistics for MMIE and friends.

=== Speaker-Adaptive Training ===

This sort of exists right now but it doesn't work all that well.  What we have (which is inverse-transform based single-class MLLR) needs to be debugged, then we need to look at doing constrained MLLR for feature-space adaptation and so on and so forth.

=== GPGPU Acceleration ===

After having vectorized some of the inner-loop computations we should be able to run them on GPUs to get a nice speed boost.  GMM computation in particular can really benefit from this.

DHDWiki: SphinxTrain (last edited 2007-11-07 06:02:03 by localhost)