TextComposter: Turning Crap into Gold
At least one component of this system has to have a cute name, and TextGrinder was already taken (as was, *ahem* Distiller).
The point of TextComposter is to take in the various resources associated with a talk, such as:
- Presentation Slides
- Handouts
- Web Pages - author's development webpages, potentially crawling to related documentation
- Package Documentation
And turn them into "fertilizer" for the SpeechRecognizer. Specifically, we need to get:
Vocabulary words, e.g. technical terms, names (NamedEntities)
- Text fragments - sentences, bullet points, whatever
From these we will build a talk-specific PronunciationDictionary and LanguageModel, which will be mixed in with the universal background model and dictionary to improve recognition.
Text Conversion
In general, pdftotext does a pretty good job of this, since people who make their slides in PDF probably used some kind of sensible tool like beamer, rst, or whatever.
Having the source text is not necessarily a good thing here since it means we have way more formats to deal with, and we do not care about formatting or any of that.
So whenever possible we should ask for PDF, HTML, or plain text.
Tokenization
We don't need to do anything too fancy here beyond splitting out punctuation and recognizing a few specific classes of tokens which may be useful in segmentation and classification (see below). Specifically:
- Punctuation
Delimiters of various sorts e.g. () {} <>
Python interactive shell prompt >>>
- URLs
- E-mail addresses
We'll use PLY to write the lexer, because the less we have to deal with Festival, the better.
Text Segmentation
The input text is going to be noisy - i.e. there are a lot of things in it that won't have any connection to the actual speech in the lecture. For example:
- Equations
- Code samples
- Tables of numbers
- Page numbers
- Tables of contents
Essentially we need to separate linguistic content from non-linguistic content.
Normalization and Lexicalization
Once we have extracted the linguistic content, we need to convert it to a stream of words. This is exactly the same thing that a text-to-speech system does. This is known as normalization because the end result is a normalized form of the text - all words are explicitly stated. So for example:
Unnormalized: We spent $2.3 million in 1995.
Normalized: we spent two point three million dollars in nineteen ninety five
