Automatic Speech Recognition

I am building a tool for doing automatic transcription of the audio of talks. A description of the system will be showing up here as time goes on.

Random note #5: Make sure to log IRC and Twitter for every talk next time! Yikes! (also, do this for my thesis defense)

Random note: It would be very nice if we had slide timings from dvswitch - it ought to log input switches.

Random note #2: FLAC works very well for storing audio tracks, to split it out of a DV file run:

ffmpeg -i foo.dv -ac 1 -ar 16000 foo.flac

Bitrate is quite impressive, <100 kbits/sec. Of course Speex-WB is like 32 kbits/sec.

Random note #3 (from Paul McNelt): Would be really excellent if the on-line video came with a transcription track that was editable. An idea cool enough that it deserves its own Wiki page: CrowdSourcingTransciption

Random note #4: In the future we need to make a record of which microphone was used for each session.

Workflow

workflow.png

Components

The transcription system consists of three main components:

Manual Transcription

We may be manually transcribing some segments of the data. This will be done by correcting automatic transcriptions. Since the talks contain large numbers of technical terms which may be unfamiliar to transcribers, we need to make a special effort to correctly transcribe these terms, and flag them as potentially unfamiliar. This could probably be done by simply tuning the language model interpolation weights.

External Software

External software components we will be using:

Some improvements will need to be made to specific components:

As yet unresolved issues:

SpeechRecognition (last edited 2010-02-24 21:52:47 by 128)