SpeechRecognizer

We are using the PocketSphinx engine with phonetically tied mixture models. We run PocketSphinx in 3-pass mode in order to get reasonable (Sphinx3-approximating) accuracy and be able to output word posterior probabilities.

In order to do unsupervised adaptation with a minimum of hacky scripting, I am adding a fourth pass to PocketSphinx, which performs state posterior probability calculation and extracts sufficient statistics for MLLR and MAP adaptation. This is a very simple process, which consists of running straight Viterbi on an HMM composed from the 1-best transcription (we may wish to use the pruned word lattice in the future).

VTLN is done on the first speech segment per speaker ID simply by maximizing the decoding likelihood.

For language modeling we use a Gigaword 64k background model interpolated with a talk-specific language model. It's not clear how to set the interpolation weights - we could hold out some of the input text to do this?

Therefore the flow of recognition is as such:

We may consider extra passes of MLLR and cross-adaptation if necessary.

SpeechRecognizer (last edited 2010-02-22 16:46:54 by 204)