Fourier Analysis and the Speech Spectrogram
Background Information
The Fourier Transform is often introduced to students as a construct to evaluate both continuous- and discrete-time signals in the frequency domain. It is first shown that periodic signals can be expressed as sums of harmonically-related complex exponentials of different frequencies. Then, the Fourier Series representation of a signal is developed to determine the magnitude of each frequency component's contribution to the original signal. Finally, the Fourier Transform is calculated to express these coefficients as a function of frequency. For the discrete-time case, the analysis equation is expressed as follows:
$ X(e^{j\omega}) = \sum_{n=-\infty}^\infty x[n] e^{-j\omega n} $
This expression yields what is commonly referred to as the "spectrum" of the original discrete-time signal, x[n]. To demonstrate why this is the case, consider the following discrete-time function:
$ x[n] = cos(2 \pi 10 t ) $
Applying the analysis equation above yields the following Fourier Transform of the signal:
$ X(e^{j\omega}) = \pi \delta (\omega - 20\pi) + \pi \delta (\omega + 20\pi) $
Intuitively, this expression shows that a cosine function is the sum of two complex exponentials with fundamental frequencies of -10 and +10 Hertz. Indeed, Euler's Formula provides a way to rewrite the cosine function in this form.
Applying the Fourier Transform
Now that the frequency-domain representation of a signal has been unlocked, what's next? In modern practice, Fourier analysis touches many industries and sciences: image and audio processing, communications, GPS and RADAR tracking, seismology, and medical imaging, to name a few. One area of research that has benefited from frequency-domain analysis is the study of human speech.
Linguistics researchers have sought for years to characterize the sounds humans utter to communicate. The presently-accepted "fundamental unit" of speech is the phoneme; every meaningful sound is comprised of one or more of these units, and every language spoken by humans utilizes a unique set of the possible phonemes.
Thanks to the advent of computing in the realm of scientific research, scientists have many tools available to analyze speech digitally. However, looking at the time-domain representation of speech signals can be frustrating. For example, the waveform for a single word can vary significantly between utterances, even when performed by the same speaker. Since researchers generally desire a consistent relationship between input and output, it is clear that a speech waveform must be transformed somehow to obtain a more descriptive relationship.
Thankfully, great strides have been made in making Fourier analysis practical using computers. In particular, the Fast Fourier Transform (FFT) provides an easy-to-implement way of transforming a finite-time signal into its frequency-domain representation. Applying this approach to speech waveforms reveals a tremendous amount of information.
The Speech Spectrogram