Acoustic Model Quantization in PocketSphinx
Data path
We need to determine how many significant bits are necessary at each step here. Currently, using fast SCHMM computation in fixed-point mode (which I think should actually just become the default for PocketSphinx), we have:
- Acoustic data - signed 16-bit
- FFT output - Q15 or Q31
- Log power spectrum - Q19.12
- MFCC - Q19.12
- Log Euclidean distance - Q19.12
- Log Mahalanobis distance - signed 32-bit
- Gaussian densities - signed 32-bit
- Normalized Gaussian densities - unsigned 8-bit (actually range of 0-96 !)
- Acoustic scores - signed 32-bit (but always negative)
- Language model scores - signed 32-bit (but also always negative)
- Viterbi path scores - signed 32-bit (but also always negative)
Clearly it is a little bit silly to be using 32 bits for things everywhere when there is a "bottleneck" at the Gaussian mixture computation that is going to throw away all but 6.5 bits of information. The question is whether we need those extra bits for some of the intermediate values in order to avoid cascading errors.
For the FFT, I'm pretty sure that Q15 is sufficient. When we go to the log power spectrum and MFCC (which have the same dynamic range, albeit a different one from the FFT), we can definitely get away with 16 bits, it's just a question of how we use them. This is the right place to introduce some form of simple linear quantization. The maximum range of these values (based on C0) can actually be computed ahead of time based on the FFT size, but it is also reasonable to compute it based on the global mean and variance of the acoustic model.
The "baseline" in the future, without any kind of sophisticated quantization, should look more like this:
- Acoustic data - signed 16-bit
- FFT output - Q15
- Log power spectrum - 16 bits
- MFCC - 16 bits, linear quantization
- Log Euclidean distance - 8 bits?
- Log Mahalanobis distance - 8 bits?
- Gaussian densities - 8 bits?
- Normalized Gaussian densities - unsigned 8-bit (actually range of 0-96 !)
- Acoustic scores - unsigned 16-bit (negated log-probs)
- Language model scores - unsigned 16-bit (negated log-probs)
- Viterbi path scores - unsigned 16-bit (negated log-probs)
Say that the dynamic range of the acoustic data is D bits, and the FFT order is M. Then the FFT output will have dynamic range of M+D bits, and the log of the FFT output will have a dynamic range of log(M+D) bits (but still should require D bits of precision). To obtain the power spectrum we square the absolute value of the DFT, giving us a dynamic range of log(M+D) + 1 bits.
References
Nokia evaluation of different model compression techniques
SDCHMM papers (original, training)
qHMM papers (original, adaptation)
PocketSUMMIT paper