2008-09-08

Update on Voice Recognition Program

Today I’m going to give you an update on my progress with the voice recognition software – this is where I am now, and how it’s gone so far, and the lessons I’ve learned so far. I think it would be a good idea if I did this every week or so – that way I could compile a list of everything that I’ve realized as it happens before I forget any of it. I hope that I will be able to use this in the future, and that anybody reading this who is interested in voice recognition software will also be able to make use of it.

I’ve been thinking in particular about what I’m going to do if I get a new computer. It will be a Macintosh, and that means I have a choice: I can either run windows XP in virtualization, or I can buy MacSpeech (which is voice recognition for the Macintosh using a licensed version of Dragon NaturallySpeaking, or at least its engine). Either way, I probably won’t be able to use my user profile that I spent all this time perfecting in this installation of the program. The standard version of Dragon NaturallySpeaking doesn’t allow you to export or import user profiles.

So, was that going to mean that I have to start all over again from scratch?

The answer is, probably not. What I realized in the couple of days since I had this revelation, is that half of the training is done on me, and not by me. I have definitely learned how to speak in a different way – the way that the program finds easier to parse and to recognize. Probably a lot of this involves the simple act of speaking more clearly, and not any special way of speaking to the microphone so that the program understands. Not being an especially sociable person, alas, and also not being too outgoing, I’ve never spoken clearly, loudly, or in any way easy to understand by listeners. Probably that means that the voice recognition software is going to have a harder time understanding what I say – harder than it would understanding most people.

Now, I think that in fact, I have been changing more than the program has been learning. Call it may be 60% to 40%, or even as high as 75% to 25% – but I imagine that 75% is an upper limit.

So that’s lesson number one: Time spent creating one user profile and minimally training the voice recognition software is not all wasted if you have to reinstall the program or moved to another machine and began a new profile. Much of that time you will be using training yourself, and that will give you a head start on any new installation or training program.

Another thing I noticed, is that when I tried to speak too clearly, I over enunciate – what I notice that I’m doing is placing special emphasis on the final consonants of some words, especially when one word ends in a consonant and the next work begins in a consonant, and I anticipate the program is going to have a hard time distinguishing between the sounds of the two consonants. Since I normally slur the end of my words, when I speak too clearly in this way, what happens is that the program thinks it should throw in another syllable there – or rather, that I have said an extra word.

So that’s lesson number two: when you originally train the program, it’s very important that you speak in your normal voice as you normally would speak. That’s what they tell you anyway – and it’s true: because when you train the program, you’re feeding the program certain sounds, phonemes as they call them. The program then expects you to speak in this same way when you go on and do actual dictation. If you speak more clearly when you train the program then you do normally, the program is going to expect you to speak just as clearly from then on. And by the same token, if you speak more clearly after you train program than you did when you trained it, the program is still going to expect you to speak just as clearly (or rather, just as unclearly) as you did when you trained the program in the first place.

Since I have changed the way I speak in light of my understanding of how the program was recognizing what I said, maybe it would be a good idea for me today to go on and train it again. It does gradually learn as you speak and dictate, whenever you correct it. He learns not only what you mean by a certain sound but also it adapts to your way of writing and speaking so that he can tell the difference between words that sound exactly the same, according to the context in which you speak them.

Here’s another way, a specific way of speaking that I know that I have to do so at the program has a better chance of understanding what I say. Since the program operates with a certain degree of what looks like artificial intelligence, context is very important. The program recognizes what you say not only as individual words strung together, but also as phrases. So it helps if you speak each sub phrase of every sentence as its own small group of words, with appropriate pauses to mark its beginning and its end.

There we have lesson number three: you must learn to speak in full phrases.

An interesting side note to all this is how where I am of good diction from other people. Today, for example, I watched the DVD version of the 1936 Bette Davis movie, Satan Met a Lady – this was the second version of Dashiell Hammett’s (I had to type out his name by hand – this is another lesson that you have to learn this program: some words and phrases that the program doesn’t know already just aren’t worth teaching it, and you save a lot of time just typing them in) The Maltese Falcon. It stars Bette Davis and Warren Williams. Williams had a diction that was amazing. It’s very clear that he had stage training and was an accomplished speaker on the stage. In fact, he rather overdid it – his performance was a little hammy.

(Composed by dictation Monday 8 September 2008.)