Award Date

8-1-2013

Degree Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science

First Committee Member

Kazem Taghva

Second Committee Member

Laxmi p. Gewali

Third Committee Member

Ajoy k. Datta

Fourth Committee Member

Emma Regentova

Number of Pages

63

Abstract

In this thesis, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate.

Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this thesis further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the

remaining errors.

Keywords

Big data; Errors – Prevention; OCR; Optical character recognition

Disciplines

Computer Sciences

Language

English


Share

COinS