Master of Science in Computer Science
First Committee Member
Second Committee Member
Laxmi p. Gewali
Third Committee Member
Ajoy k. Datta
Fourth Committee Member
Number of Pages
In this thesis, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate.
Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this thesis further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the
Big data; Errors – Prevention; OCR; Optical character recognition
Agarwal, Shivam, "Utilizing Big Data in Identification and Correction of OCR Errors" (2013). UNLV Theses, Dissertations, Professional Papers, and Capstones. 1914.