Introduction to Optical Character Recognition(OCR)

1. Introduction

  • Optical character Recognition (OCR) serves as a tool to detect information from natural images and transfer them into machine-coded texts, such as words, symbols and numbers. It is still a hot ongoing search area and some novel algorithms are publishing from time to time. It is pretty interesting and essential to recognize the characters in the image because it could help greatly in some certain area: auto plate number recognition, books and documents scanning, assistive technology for blind and visually impaired users , zip-code recognition needed for post offices and much more.
  • In this page, I would like to introduce a basic and simple method to transfer typed alphabets or numbers into machine coded texts.

2. algorithm overview

  • OCR is a simple machine learning example. Machine learning is all about learning the patterns and features from the known data and then making predictions accordingly. It is powerful in pattern recognition field. One fundamental method of machine learning is using computational statistics. For example, the way we detect whether two sinusoidal signals are the same is to compare the maximum magnitude, period and relative phase. These three numbers are so-called computational statistics that we extract from the data. By comparing with the known sinusoidal signals, we can then predict or know what functions could be used to describe the unknown signals.
  • Most machine learning requires training and testing processing. For specific task, we will analyze and figure out a group of representative numbers that would well describe the data like the maximum magnitude of sinusoidal signals. The training process firstly computes all the statistical features of marked data which we know what they are. After saving all the values, we then compute all the statistics for the unknown data which we want to figure out what they are. By comparing the unknown data with the whole data set of the known data, we would predict the unknown data is the same as certain known data if their statistics are quite similar or even identical with other comparison. This is the testing processing: extracting the feature statistics and then compare values to classify the unknown data.
  • In some cases, we need pre-processing to transform the data in some way to get the statistical value, such as the Fourier Transformation if we want to get the formants from the audio signal.

3. algorithm assumption

  • The proposed algorithm is simple and easy to learn. The purpose of this project is to welcome talents like you to get involved with the recognition world.
  • We assume the input image has a clean background. The "clean" here means the contrast between the background and characters is high enough to detect. It works best when the background is white. It will enhance the success rate of the algorithm. The colored letters are intended to include in the project. However, if the returned results are weird, try to convert the image into gray scale. Because binarization helps us to calculate a relatively robust threshold value to classify the results.
  • We assume all the characters are lined horizontally. But they could be placed in several lines. In the project, the technique we used to segment each character requires this particular arrangement.
  • We assume all the characters are machine printed or similar to that. It is because the training data set contains only a particular letter style. So maybe some machine printed testing images would also return undesired results.
  • We assume all the test data contains characters whose font size is not less than 42 * 24 pixels. However, if that is small, You want to use the function 'imdilate' to increase its thickness before executing the program.

4. main steps of the algorithm

1. Segmentation

  • It is a pre-processing part. The preliminary step is to convert the image into binary number by 'im2bw'(MATLAB command). The white pixel returns 0 and black pixel returns 1 by the function.
  • After that, crop the image to fit the text. We find the minimum and maximum index of the picture that contains a text pixel vertically and horizontally. Then we can get the cropped image by the calculated index.
  • Then extract the whole characters line by line. We sum up each row to find the first one to be zero. And we then know the index of the row to trim. By doing this, we can separate the first line of texts with the remaining if there are several lines. If we repeat this procedure, we can successfully separate all the lines that contains texts.
  • Finally, we can trim each character in each line. To achieve this, we will use the function 'bwlabel'(MATLAB command). The function 'L = bwlabel(BW)' returns the label matrix L that contains labels for the 8-connected objects found in BW. The label matrix, L, is the same size as BW. Finally, we can extract every signal character from each line. We declare that all training data set of images are 42*24 pixels. So we try to manipulate the trimmed images into size of 42*24 to smoothly conduct the next step.
  • The code is from: http://www.mathworks.com/matlabcentral/fileexchange/8031-image-segmentation---extraction.

2. Classification

  • According to Tou and Gonzalez, “The principal function of a pattern recognition system is to yield decisions concerning the class membership of the patterns with which it is confronted.” For this project, the goal is to compare the image of each trimmed character with that of training data. Due to the fact that each image of characters is made up with numerical pixels, we just need to compare pixel by pixel to find the level of similarity. So here, we calculate the correlation between the unknown data with the training marked data in two dimensions. The formula below is used to find the value of correlation. The higher value means the better match.
  • $ r = \frac{ \sum_{m=1}^{M} \sum_{n=1}^{N} \left( A_{mn}-\bar{A} \right) \left( B_{mn}-\bar{B} \right) } {\sqrt{ \left( \sum_{m=1}^{M} \sum_\sum_{n=1}^{N} \left( A_{mn}-\bar{A} \right)^2 \right) \left( \sum_{m=1}^{M} \sum_{n=1}^{N} \left( B_{mn}-\bar{B} \right)^2 \right)}} $
  • Here, m represents the row of the matrix of images and n represents the column of the matrix of images. In our particular case, M = 42 and N = 24.
  • Then we find the maximum value of correlation within all the comparisons with training data. And the corresponding letter or number of the max correlation in training data is the recognized symbol for the each character.

5. Discussion

  • I tried several different images to go through the code and the performance is not good enough. Though I can get several perfect example, the majority examples have flaws. According to my comparison and analysis of all testing example, I think the different fonts and sizes of characters and the quality of image may affect the results. The training data set only contains one image for each character, which means the compared model cannot fully represent the class. So the different fronts and sizes of characters may return some weird results. And if the quality of image is bad, the value of correlation is not valid anymore, which will lead to wrong selection.
  • So the algorithm could be improved a lot. The training data could be increased so that it would enhance the accuracy of the results. Additionally, I would like to add some rules to distinguish the similar characters, such like I&L and T&I, as well as the capital and small letters.

6. References



Alumni Liaison

Prof. Math. Ohio State and Associate Dean
Outstanding Alumnus Purdue Math 2008

Jeff McNeal