TextComposter: Turning Crap into Gold

At least one component of this system has to have a cute name, and TextGrinder was already taken (as was, *ahem* Distiller).

The point of TextComposter is to take in the various resources associated with a talk, such as:

And turn them into "fertilizer" for the SpeechRecognizer. Specifically, we need to get:

From these we will build a talk-specific PronunciationDictionary and LanguageModel, which will be mixed in with the universal background model and dictionary to improve recognition.

Text Conversion

In general, pdftotext does a pretty good job of this, since people who make their slides in PDF probably used some kind of sensible tool like beamer, rst, or whatever.

Having the source text is not necessarily a good thing here since it means we have way more formats to deal with, and we do not care about formatting or any of that.

So whenever possible we should ask for PDF, HTML, or plain text.

Tokenization

We don't need to do anything too fancy here beyond splitting out punctuation and recognizing a few specific classes of tokens which may be useful in segmentation and classification (see below). Specifically:

We'll use PLY to write the lexer, because the less we have to deal with Festival, the better.

Text Segmentation

The input text is going to be noisy - i.e. there are a lot of things in it that won't have any connection to the actual speech in the lecture. For example:

Essentially we need to separate linguistic content from non-linguistic content.

Normalization and Lexicalization

Once we have extracted the linguistic content, we need to convert it to a stream of words. This is exactly the same thing that a text-to-speech system does. This is known as normalization because the end result is a normalized form of the text - all words are explicitly stated. So for example:

Unnormalized: We spent $2.3 million in 1995.

Normalized: we spent two point three million dollars in nineteen ninety five

Language Model Training

TextComposter (last edited 2010-02-24 18:21:18 by 128)