Predictor of Ocr accuracy using statistical techniques

Juan Manuel Gonzalez, University of Nevada, Las Vegas


Systems that predict optical character recognition (OCR) accuracy of an input image by a given OCR system were developed. Seven features associated with image defects were identified and utilized. Two kinds of nonparametric classification engines, the nearest neighbor rule-based and neural network-based, were implemented. The performance of these systems were compared to an old heuristic-based system using a cost model of a large-scale document conversion process and a test data set consisting of 502 pages. The results show that the performance of new classifiers were better than that of the heuristic-based system. The neural network-based system outperformed the nearest-neighbor-based system. These new systems can be used to reduce the cost of a large-scale document conversion process by discriminating good quality pages for OCR from degraded images for manual data entry.