Building language models from e-mail
There is nothing really new here, but this is a useful tool to have for all the people out there who are wondering why Sphinx doesn't give them very good results when they try to use it for dictation. All we really need to do is:
- Iterate over a mailbox (in whatever format ... Python has modules for this)
- Extract full names from the address headers
- Extract the subject header
- Extract any plain-text body text, discarding as much junk as possible, for example:
- Headers from quoted messages posted in-line (Outlook brain damage)
- Diffs and other stuff that looks like source code (we need a good heuristic for this)
- Long strings of punctuation, like ######
- Other stuff that gets pasted in-line and is clearly not human language
- Feed this stuff to the flite text-processing front end to get sentences and words
- Train a language model
- Get pronunciations for these words
- flite makes everything lower-case which is actually okay, but we need to be aware of this when running ngram_pronounce
- flite also uses various symbols as words which may or may not be compatible with ngram_pronounce (probably not in fact)
- also need to make sure to clean up the difference between a and _a
One thing I've been curious about is that commercial dictation systems usually come with a tool like this, but it's not clear how exactly they use the results - I would guess that they do some kind of LM adaptation but I don't know what...
I've decided to do myself a favor and implement the Flite part of this by creating a [:PyFlite:Python interface to Filte]. It should be possible to compile that against any version of Flite, although it helps to have enabled shared libraries (this is necessary on x86_64). The current iteration of the e-mail scraper consists of two programs:
- attachment:emailscraper.py - this program extracts paragraphs of text from e-mail and dumps them to standard output
- attachment:text2words.py - this program reads paragraphs of text from standard output, calls Flite to normalize them, then prints them out in a format suitable for language modeling, and also optionally outputs a pronunciation dictionary based on Flite's pronunciations of the words.
Interesting future ideas
Wait a minute, we're language technologists here. We don't need to use ad-hoc rules to determine which parts of e-mail are English text and which are source code and random junk. This can be thought of as a simple classification problem. Or, more interestingly, we can do something Yarowsky-like to use a small selection of e-mails to do semi-supervised training of a model for the rest of them.