SpeechRecognizer
We are using the PocketSphinx engine with phonetically tied mixture models. We run PocketSphinx in 3-pass mode in order to get reasonable (Sphinx3-approximating) accuracy and be able to output word posterior probabilities.
In order to do unsupervised adaptation with a minimum of hacky scripting, I am adding a fourth pass to PocketSphinx, which performs state posterior probability calculation and extracts sufficient statistics for MLLR and MAP adaptation. This is a very simple process, which consists of running straight Viterbi on an HMM composed from the 1-best transcription (we may wish to use the pruned word lattice in the future).
VTLN is done on the first speech segment per speaker ID simply by maximizing the decoding likelihood.
For language modeling we use a Gigaword 64k background model interpolated with a talk-specific language model. It's not clear how to set the interpolation weights - we could hold out some of the input text to do this?
Therefore the flow of recognition is as such:
- For each speaker ID (many of these correspond to the same speaker, but...)
- Decode first segment (or first N seconds of audio) with all warping factors
- Store optimal warping factor in speaker database for this lecture (MLLR goes in here too)
- Using VTLN, decode all segments, accumulating MLLR statistics for each speaker
Run MLLR using those statistics (use mllr.py implementation to avoid SphinxTrain)
- Re-run decoding with VTLN and MLLR
We may consider extra passes of MLLR and cross-adaptation if necessary.
