Line 29: Line 29:
  
 
'''The Speech Spectrogram'''
 
'''The Speech Spectrogram'''
 +
 +
Human speech, along with most sound waveforms, is comprised of many frequency components; the human ear is capable of detecting frequencies between 20Hz and 20,000Hz, although most linguistic information seems to be "concentrated" below 8kHz, according to many researchers. Using this information, it's possible to design a graphical device to represent a speech waveform. To be functional, it needs to show the magnitudes of frequency components, and should be displayed on the time axis (since speech signals vary in frequency composition with time). As luck would have it, such a graphical device is available: the spectrogram.
 +
 +
A speech spectrogram shows the Fourier Transform of a signal as it varies with time. The magnitude of the frequency components are generally either represented as changing colors (along a set color scale) or varying shades of black for a grayscale plot.

Revision as of 15:33, 22 September 2009

Fourier Analysis and the Speech Spectrogram

Background Information

The Fourier Transform is often introduced to students as a construct to evaluate both continuous- and discrete-time signals in the frequency domain. It is first shown that periodic signals can be expressed as sums of harmonically-related complex exponentials of different frequencies. Then, the Fourier Series representation of a signal is developed to determine the magnitude of each frequency component's contribution to the original signal. Finally, the Fourier Transform is calculated to express these coefficients as a function of frequency. For the discrete-time case, the analysis equation is expressed as follows:

$ X(e^{j\omega}) = \sum_{n=-\infty}^\infty x[n] e^{-j\omega n} $

This expression yields what is commonly referred to as the "spectrum" of the original discrete-time signal, x[n]. To demonstrate why this is the case, consider the following discrete-time function:

$ x[n] = cos(2 \pi 10 t ) $

Applying the analysis equation above yields the following Fourier Transform of the signal:

$ X(e^{j\omega}) = \pi \delta (\omega - 20\pi) + \pi \delta (\omega + 20\pi) $

Intuitively, this expression shows that a cosine function is the sum of two complex exponentials with fundamental frequencies of -10 and +10 Hertz. Indeed, Euler's Formula provides a way to rewrite the cosine function in this form.

Applying the Fourier Transform

Now that the frequency-domain representation of a signal has been unlocked, what's next? In modern practice, Fourier analysis touches many industries and sciences: image and audio processing, communications, GPS and RADAR tracking, seismology, and medical imaging, to name a few. One area of research that has benefited from frequency-domain analysis is the study of human speech.

Linguistics researchers have sought for years to characterize the sounds humans utter to communicate. The presently-accepted "fundamental unit" of speech is the phoneme; every meaningful sound is comprised of one or more of these units, and every language spoken by humans utilizes a unique set of the possible phonemes. It is the aim of phonetics researchers to develop methods for uniquely identifying phonemes in speech. Additionally, digital speech recognition depends on the ability to uniquely-identify single words or phrases; time and frequency domain analysis of speech signals are tools of choice for researchers in the field.

Thanks to the advent of computing in the realm of scientific research, scientists have many tools available to analyze speech digitally. However, looking at the time-domain representation of speech signals can be frustrating; the waveform for a single word or phrase can vary significantly between utterances, even when performed by the same speaker. Consider, for example, the multitude of ways the word "candidate" can be pronounced. Since researchers generally desire a consistent relationship between input and output, it is clear that a speech waveform must be transformed somehow to obtain a more descriptive relationship.

Thankfully, great strides have been made in making Fourier analysis practical using computers. In particular, the Fast Fourier Transform (FFT) algorithm provides an easy-to-implement way of transforming a finite-time signal into its frequency-domain representation. Applying this approach to speech waveforms reveals a tremendous amount of information.

The Speech Spectrogram

Human speech, along with most sound waveforms, is comprised of many frequency components; the human ear is capable of detecting frequencies between 20Hz and 20,000Hz, although most linguistic information seems to be "concentrated" below 8kHz, according to many researchers. Using this information, it's possible to design a graphical device to represent a speech waveform. To be functional, it needs to show the magnitudes of frequency components, and should be displayed on the time axis (since speech signals vary in frequency composition with time). As luck would have it, such a graphical device is available: the spectrogram.

A speech spectrogram shows the Fourier Transform of a signal as it varies with time. The magnitude of the frequency components are generally either represented as changing colors (along a set color scale) or varying shades of black for a grayscale plot.

Alumni Liaison

Correspondence Chess Grandmaster and Purdue Alumni

Prof. Dan Fleetwood