Award Date
8-1-2013
Degree Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science
First Committee Member
Kazem Taghva
Second Committee Member
Laxmi p. Gewali
Third Committee Member
Ajoy k. Datta
Fourth Committee Member
Emma Regentova
Number of Pages
63
Abstract
In this thesis, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate.
Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this thesis further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the
remaining errors.
Keywords
Big data; Errors – Prevention; OCR; Optical character recognition
Disciplines
Computer Sciences
File Format
Degree Grantor
University of Nevada, Las Vegas
Language
English
Repository Citation
Agarwal, Shivam, "Utilizing Big Data in Identification and Correction of OCR Errors" (2013). UNLV Theses, Dissertations, Professional Papers, and Capstones. 1914.
http://dx.doi.org/10.34917/4797981
Rights
IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/