Automatic Speech Recognition
I am building a tool for doing automatic transcription of the audio of talks. A description of the system will be showing up here as time goes on.
Random note #5: Make sure to log IRC and Twitter for every talk next time! Yikes! (also, do this for my thesis defense)
Random note: It would be very nice if we had slide timings from dvswitch - it ought to log input switches.
Random note #2: FLAC works very well for storing audio tracks, to split it out of a DV file run:
ffmpeg -i foo.dv -ac 1 -ar 16000 foo.flac
Bitrate is quite impressive, <100 kbits/sec. Of course Speex-WB is like 32 kbits/sec.
Random note #3 (from Paul McNelt): Would be really excellent if the on-line video came with a transcription track that was editable. An idea cool enough that it deserves its own Wiki page: CrowdSourcingTransciption
Random note #4: In the future we need to make a record of which microphone was used for each session.
Workflow
Components
The transcription system consists of three main components:
TextComposter - extracts seed text from lecture related materials, such as presentation slides, outlines, web sites, and manuscripts,and produces lecture-specific language models and dictionaries.
SpeechSegmenter - breaks audio into segments of speech from individual speakers.
SpeechRecognizer - converts segments of speech into text, using the seed language model data output by TextComposter.
Manual Transcription
We may be manually transcribing some segments of the data. This will be done by correcting automatic transcriptions. Since the talks contain large numbers of technical terms which may be unfamiliar to transcribers, we need to make a special effort to correctly transcribe these terms, and flag them as potentially unfamiliar. This could probably be done by simply tuning the language model interpolation weights.
External Software
External software components we will be using:
CMU Sphinx - speech recognition. Specifically the PocketSphinx decoder, with some auxiliary tools for acoustic model adaptation (Python modules from SphinxTrain)
CMU Language Modeling Toolkit. Includes a set of Perl modules for text normalization and language model training, which TextComposter is built on top of.
LIUM_SpkDiarization tool for speaker diarization (written in Java)
- Various tools for text scraping (pdftotext, others...?)
Some improvements will need to be made to specific components:
Add Viterbi stats to PocketSphinx - allows us to simultaneously decode and accumulate occupation counts
Maybe add FLAC input support to PocketSphinx
MLLR tool - use the Python one from SphinxTrain, but add support for multi-stream and multi-class
- Multi-class MLLR - use bottom-up senone clustering code (or try it at least) - or perhaps just one MLLR per codebook since we are using PTM models
Add on-the-fly language model interpolation to PocketSphinx Python bindings (might already be there?)
As yet unresolved issues:
- Running on the CMU speech cluster - is there a Python multiprocessing module that interoperates with PBS/TORQUE?
