Digital Speech Generation

Background

  • Digital Speech Generation, as the name suggests, is the process of making a computer "speak" a sequence of letters/words in a meaningful way.
  • For English, this is especially hard, because it is a very un-mathematical language.
  • What I mean by unmathematical, is that the "a" in "apple" is not pronounced the same way as the "a" in "hate". In other words, letters do not sound the same way in different words.
  • Sure, this might be true for consonants, but vowels, not by a long shot.
  • Hindi for example, is a little more mathematical. The "आ" in "आप", meaning "you" with respect, will be the same as the "आ" in "आरंभ" which means "start", and this is true for all uses of आ.
  • If English were the same way, all we would have to do would be to assign a sound to every letter and then just play it.
  • Unfortunately, its not that simple.
  • What is done, is to use another type of alphabet, the phonetic alphabet, which basically assigns a sound to every alphabet within it, and in doing so, this set of alphabets can generate any sound in the human language (Or can they?). http://www.langsci.ucl.ac.uk/ipa/pulmonic.html

Motivation

  • The above technique sounds great right? The question is, how easy is it to implement?
  • The following experiment will explore if this can indeed be done.
  • Unfortunately, IPA (the International Phonetic Association) does not give you these phonetic elements in a neat little zip file, nicely documented and ready to use.
  • What is gives you is a folder called American English, which has a bunch of words that you hear along with a $30 Handbook to know what each phoneme means.
  • However, if we ignore our desire to conform to standards for a minute, we can easily sieve out the syllables from these words that contain the phonemes that we need.
  • I named my own phonemes, and broke down my test word "ECE 438" into a composition of these phonemes. So, please don't take the names on these phonemes as standard IPA language. For example, if I say ee, I mean the ee in eel. The phonetic code for that is $ i: $
  • Also, I am required to add this copyright information. For the record, I am allowed to use it for educational purposes. Copyright:phonetic_sounds


Experiment: Can MATLAB say ECE 438?

  • The answer to the above question is Yes, it can.
  • As I stated above, IPA gives you a bunch of words like "sky","tie" etc in their zip file.
  • What I did was to load these into MATLAB, cut out the part of the word containing the syllable I needed and then concatenated these syllables together to form the speech I needed.
  • The following code will clarify this method:

% %                           COPYRIGHT DHRUV LAMBA                     % %

% % % % % %     BIG WORDS TO CLEAVE BITS FROM             % % % % % % % % %
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %


bead  = wavread('Phonetic_sounds\American-English\Vowels\01-bead.wav');
sigh  = wavread('Phonetic_sounds\American-English\Consonants\12-sigh.wav');
fie   = wavread('Phonetic_sounds\American-English\Consonants\04-fie.wav');
bode  = wavread('Phonetic_sounds\American-English\Vowels\07-bode.wav');
thigh = wavread('Phonetic_sounds\American-English\Consonants\10-thigh.wav');
kite  = wavread('Phonetic_sounds\American-English\Consonants\16-kite.wav');
bade  = wavread('Phonetic_sounds\American-English\Vowels\03-bayed.wav');
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %



% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
% % % % % % % % % %           WORD BITS               % % % % % % % % % % %

pause = (zeros(1,1000))';
ee    = bead(3000:5000); clear bead;
ss    = sigh(1:4000); clear sigh;
ff    = fie(1:3500); clear fie;
oh    = bode(4000:7000);clear bode;
thin  = thigh(1:4000);clear thigh;
tee   = kite(6000:10000);clear kite;
ay    = bade(3000:7000);clear bade;
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

total = [ee;ss;ee;pause;ee;pause;pause;ff;oh;pause;thin;tee;ee;pause;ay;tee];
sound(total,15000);
wavwrite(total,15000,'ece438.wav')
  • So as you can see above. I take a word, say "bead" and then cut out the part that contains the "ee".
  • I did this mostly by educated guessing. For example, in bead, you know that starting b is short, and the ending d is short too, so if we cut out the middle part of the signal, we should get our ee.
  • Of course, this had to be tweaked till it sounded good.
  • And if you decide to try this yourself, you might notice the IPA voice is a bit different from what it sounds like here. That's because I set the sampling audio to 15kHz to get a nice smooth voice.
  • In conclusion, I broke my test word, ECE 438 into ee;ss;ee;pause;ee;pause;pause;ff;oh;pause;thin;tee;ee;pause;ay;tee and concatenated them to form the complete word.


  • Now, granted, it sounds really weird; but I believe this is the concept behind it (if I'm wrong please ping me).
  • The idea is to have all the possible sounds a human being can make stored as word bits in table.
  • Then, convert the input text, using an algorithm (which hopefully has been trained), to the phonetic alphabet.
  • After that, simply join those sounds to form one word that has all those sounds in the right sequence.
    • ATT Labs has a text-2-speech that sounds much better. Here it is: http://www.research.att.com/~ttsweb/tts/demo.php. Yet, for a simple MATLAB based speaking program, I believe I did an OK job. Type E C E instead of ECE to hear it correctly



Resources

  • Here are some really cool web-pages that I came across in my research.


--Dlamba 21:50, 25 October 2009 (UTC)

Alumni Liaison

Meet a recent graduate heading to Sweden for a Postdoctorate.

Christine Berkesch