m |
|||
(4 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
== Background == | == Background == | ||
− | The need for reliable voice recognition has become a universal goal ever since apple unveiled Siri 5 years ago. Even though initial version of voice assistant was sloppy people quickly realized the potential for a pocket personal assistant controlled mainly by your voice. With the improvement of the voice recognition technology, it's applications expanded from personal assistant to everything from voice dialing, appliance control, speech to text processing, data entry and security application to mention a few. I felt that an embedded application for voice password recognition would be a great addition to any security technology with the rise of IoT | + | The need for reliable voice recognition has become a universal goal ever since apple unveiled Siri 5 years ago. Even though initial version of voice assistant was sloppy people quickly realized the potential for a pocket personal assistant controlled mainly by your voice. With the improvement of the voice recognition technology, it's applications expanded from personal assistant to everything from voice dialing, appliance control, speech to text processing, data entry and security application to mention a few. I felt that an embedded application for voice password recognition would be a great addition to any security technology with the rise of IoT applications. |
+ | |||
+ | ---- | ||
+ | |||
+ | == Theory== | ||
The most common algorithm used by voice recognition systems is the Mel-Frequency Cepstral (MFC), but implementing this on a micro-controller would be very hard due to the complexity of the algorithm. I used the Cepstrum algorithm to process a recorded voice signal. It was much simpler than the MFC algorithm and the essence of the Cepstrum algorithm is given by this formula: | The most common algorithm used by voice recognition systems is the Mel-Frequency Cepstral (MFC), but implementing this on a micro-controller would be very hard due to the complexity of the algorithm. I used the Cepstrum algorithm to process a recorded voice signal. It was much simpler than the MFC algorithm and the essence of the Cepstrum algorithm is given by this formula: | ||
Line 13: | Line 17: | ||
Power Cepstrum of signal <math> =\left|{\mathcal {F}}^{-1}\left\{{\mbox{log}}(\left|{\mathcal {F}}\left\{f(t)\right\}\right|^{2})\right\}\right|^{2} </math> | Power Cepstrum of signal <math> =\left|{\mathcal {F}}^{-1}\left\{{\mbox{log}}(\left|{\mathcal {F}}\left\{f(t)\right\}\right|^{2})\right\}\right|^{2} </math> | ||
− | The power cepstrum is a unique signature of the signal. For a security application users will be able to program voice passwords into a system, which will be used as reference signals to compare the voice password used when the user wants to unlock the system. This comparison can be done using correlation coefficient and say that if there is an 70% correlation with the reference signal, then the system will be unlocked. | + | The power cepstrum is a unique signature of the signal and the similarity between two signals can be determined by comparing their power cepstrums. For a security application users will be able to program voice passwords into a system, which will be used as reference signals to compare the voice password used when the user wants to unlock the system. This comparison can be done using correlation coefficient and say that if there is an 70% correlation with the reference signal, then the system will be unlocked. |
---- | ---- | ||
== The code == | == The code == | ||
+ | The recursive function FFT uses the radix-2 algorithm to compute the FFT. Since C doesn't have the capability to deal with imaginary numbers, the real and imaginary coefficients of the twiddle factor are calculated using the polar form and are used to calculate the magnitude which is always a real number. The user would have a 2.5 second window to say a password and will get a result based on the match with the reference signal. The microphone samples audio at 20 kHz, but this would generate a lot of data points so only 400 Hz is used as the sampling frequency. This would call for 1024 point FFT if the audio is recorded for 2.56 seconds. | ||
+ | <source lang="c"> | ||
+ | int* FFT(int record[1024], int N, int s) | ||
+ | { | ||
+ | //DECLARATIONS | ||
+ | int A[N/2]; | ||
+ | int B[N/2]; | ||
+ | int *a; | ||
+ | int *b; | ||
+ | int out[N]; | ||
+ | int wn; | ||
− | < | + | //STATEMENTS |
+ | if (N != 2 && N != 0) | ||
+ | { | ||
+ | for(int k = 0; k < N; k= k+2) | ||
+ | { | ||
+ | A[k] = record[k]; | ||
+ | } | ||
− | </ | + | for(int k = 1; k < N; k= k+2) |
+ | { | ||
+ | B[k] = record[k]; | ||
+ | } | ||
+ | |||
+ | a = FFT(A,N/2,s); | ||
+ | b = FFT(A,N/2,s); | ||
+ | |||
+ | for(int k = 0; k < N/2; k= k+2) | ||
+ | { | ||
+ | wn = twiddle(N,k,s); | ||
+ | out[k] = a[k]+wn*b[k+N/2]; | ||
+ | } | ||
+ | for(int k = N/2; k < N; k= k+2) | ||
+ | { | ||
+ | wn = twiddle(N,k,s); | ||
+ | out[k] = a[k]-wn*b[k+N/2]; | ||
+ | } | ||
+ | |||
+ | return out; | ||
+ | |||
+ | } | ||
+ | else if(N ==2) | ||
+ | { | ||
+ | |||
+ | out[0] = record[0]+record[1]; | ||
+ | out[1] = record[0]-record[1]; | ||
+ | |||
+ | return out; | ||
+ | } | ||
+ | |||
+ | } | ||
+ | |||
+ | float twiddle(int N, int k, int s) | ||
+ | { | ||
+ | float out; | ||
+ | if ( s==1) | ||
+ | { | ||
+ | out = cos(12.5836*k/N); | ||
+ | } | ||
+ | else if( s==2) | ||
+ | { | ||
+ | out = -sin(12.5836*k/N); | ||
+ | } | ||
+ | |||
+ | else if( s==3) | ||
+ | { | ||
+ | out = sin(12.5836*k/N); | ||
+ | } | ||
+ | |||
+ | return out; | ||
+ | } | ||
+ | |||
+ | int* butterfly(int A[1024],int B[1024], int N, int s) | ||
+ | { | ||
+ | int out[N]; | ||
+ | float wn; | ||
+ | for(int x = 0; x < N/2; x++) | ||
+ | { | ||
+ | A[x] = A[x]; | ||
+ | } | ||
+ | for(int x = 0; x < N/2; x++) | ||
+ | { | ||
+ | B[x] = B[x]; | ||
+ | } | ||
+ | for(int r = 0; r < N/2; r++) | ||
+ | { | ||
+ | wn = twiddle(N,r,s); | ||
+ | out[r] = A[r]+wn*B[r+(N/2)]; | ||
+ | } | ||
+ | |||
+ | for(int y = N/2; y < N; y++) | ||
+ | { | ||
+ | wn = twiddle(N,y,s); | ||
+ | out[y] = A[y]-wn*B[y+(N/2)]; | ||
+ | } | ||
+ | |||
+ | return out; | ||
+ | } | ||
+ | |||
+ | int* decimation(int A[1024], int N, int l) | ||
+ | { | ||
+ | int out[N]; | ||
+ | for( int r = l; r < N; r = r+2) | ||
+ | { | ||
+ | out[r/2] = A[r]; | ||
+ | } | ||
+ | return out; | ||
+ | } | ||
+ | </source> | ||
---- | ---- | ||
− | == | + | == Conclusion == |
+ | The approximations used and the use of integer type instead of a double type to represent the audio reduced the maximum match obtained to about 40%. Increasing the sampling frequency and reducing the window of data collection might result in better accuracy of this implementation. The algorithm needs more testing and adjustments before it could be used more reliably | ||
+ | ---- | ||
== Sources == | == Sources == | ||
+ | |||
+ | http://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html |
Latest revision as of 21:10, 23 April 2017
Recognizing Voice passwords using FFT
by Anuraag Rajasekhar
Contents
Background
The need for reliable voice recognition has become a universal goal ever since apple unveiled Siri 5 years ago. Even though initial version of voice assistant was sloppy people quickly realized the potential for a pocket personal assistant controlled mainly by your voice. With the improvement of the voice recognition technology, it's applications expanded from personal assistant to everything from voice dialing, appliance control, speech to text processing, data entry and security application to mention a few. I felt that an embedded application for voice password recognition would be a great addition to any security technology with the rise of IoT applications.
Theory
The most common algorithm used by voice recognition systems is the Mel-Frequency Cepstral (MFC), but implementing this on a micro-controller would be very hard due to the complexity of the algorithm. I used the Cepstrum algorithm to process a recorded voice signal. It was much simpler than the MFC algorithm and the essence of the Cepstrum algorithm is given by this formula:
Power Cepstrum of signal $ =\left|{\mathcal {F}}^{-1}\left\{{\mbox{log}}(\left|{\mathcal {F}}\left\{f(t)\right\}\right|^{2})\right\}\right|^{2} $
The power cepstrum is a unique signature of the signal and the similarity between two signals can be determined by comparing their power cepstrums. For a security application users will be able to program voice passwords into a system, which will be used as reference signals to compare the voice password used when the user wants to unlock the system. This comparison can be done using correlation coefficient and say that if there is an 70% correlation with the reference signal, then the system will be unlocked.
The code
The recursive function FFT uses the radix-2 algorithm to compute the FFT. Since C doesn't have the capability to deal with imaginary numbers, the real and imaginary coefficients of the twiddle factor are calculated using the polar form and are used to calculate the magnitude which is always a real number. The user would have a 2.5 second window to say a password and will get a result based on the match with the reference signal. The microphone samples audio at 20 kHz, but this would generate a lot of data points so only 400 Hz is used as the sampling frequency. This would call for 1024 point FFT if the audio is recorded for 2.56 seconds.
int* FFT(int record[1024], int N, int s) { //DECLARATIONS int A[N/2]; int B[N/2]; int *a; int *b; int out[N]; int wn; //STATEMENTS if (N != 2 && N != 0) { for(int k = 0; k < N; k= k+2) { A[k] = record[k]; } for(int k = 1; k < N; k= k+2) { B[k] = record[k]; } a = FFT(A,N/2,s); b = FFT(A,N/2,s); for(int k = 0; k < N/2; k= k+2) { wn = twiddle(N,k,s); out[k] = a[k]+wn*b[k+N/2]; } for(int k = N/2; k < N; k= k+2) { wn = twiddle(N,k,s); out[k] = a[k]-wn*b[k+N/2]; } return out; } else if(N ==2) { out[0] = record[0]+record[1]; out[1] = record[0]-record[1]; return out; } } float twiddle(int N, int k, int s) { float out; if ( s==1) { out = cos(12.5836*k/N); } else if( s==2) { out = -sin(12.5836*k/N); } else if( s==3) { out = sin(12.5836*k/N); } return out; } int* butterfly(int A[1024],int B[1024], int N, int s) { int out[N]; float wn; for(int x = 0; x < N/2; x++) { A[x] = A[x]; } for(int x = 0; x < N/2; x++) { B[x] = B[x]; } for(int r = 0; r < N/2; r++) { wn = twiddle(N,r,s); out[r] = A[r]+wn*B[r+(N/2)]; } for(int y = N/2; y < N; y++) { wn = twiddle(N,y,s); out[y] = A[y]-wn*B[y+(N/2)]; } return out; } int* decimation(int A[1024], int N, int l) { int out[N]; for( int r = l; r < N; r = r+2) { out[r/2] = A[r]; } return out; }
Conclusion
The approximations used and the use of integer type instead of a double type to represent the audio reduced the maximum match obtained to about 40%. Increasing the sampling frequency and reducing the window of data collection might result in better accuracy of this implementation. The algorithm needs more testing and adjustments before it could be used more reliably