Effect of Ocr errors on short documents

Padma Inaparthy, University of Nevada, Las Vegas

Abstract

Presented in this thesis is a study of the effect of OCR errors on short documents. OCR recognizes and translates text image into ASCII format. When this data is retrieved in response to a query, the retrieval performance depends on the efficiency of the OCR device used. Measures like recall, precision and ranking were used to gauge the retrieval performance. The information retrieval system that was used is SMART, based on the vector space model. On evaluating these measures, it has been concluded that average precision and recall are not affected significantly when the OCR collection is compared to its corrected version. However, it was also concluded that with more complex weighting schemes, the relevant document rankings became more divergent. Also, the effect of an automatic post-processing system on the retrieval performance was studied.