Award Date

8-1-2013

Degree Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science

First Committee Member

Kazem Taghva

Second Committee Member

Laxmi p. Gewali

Third Committee Member

Ajoy k. Datta

Fourth Committee Member

Emma Regentova

Number of Pages

Abstract

In this thesis, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate.

Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this thesis further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the

remaining errors.

Keywords

Big data; Errors – Prevention; OCR; Optical character recognition

Disciplines

Computer Sciences

File Format

pdf

Degree Grantor

University of Nevada, Las Vegas

Language

English

Repository Citation

Agarwal, Shivam, "Utilizing Big Data in Identification and Correction of OCR Errors" (2013). UNLV Theses, Dissertations, Professional Papers, and Capstones. 1914.
http://dx.doi.org/10.34917/4797981

Rights

IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/

Download

Included in

Computer Sciences Commons

COinS

Digital Scholarship@UNLV

UNLV Theses, Dissertations, Professional Papers, and Capstones

Utilizing Big Data in Identification and Correction of OCR Errors

Award Date

Degree Type

Degree Name

Department

First Committee Member

Second Committee Member

Third Committee Member

Fourth Committee Member

Number of Pages

Abstract

Keywords

Disciplines

File Format

Degree Grantor

Language

Repository Citation

Rights

Included in

Browse

Digital Scholarship@UNLV

UNLV Theses, Dissertations, Professional Papers, and Capstones

Utilizing Big Data in Identification and Correction of OCR Errors

Author

Award Date

Degree Type

Degree Name

Department

First Committee Member

Second Committee Member

Third Committee Member

Fourth Committee Member

Number of Pages

Abstract

Keywords

Disciplines

File Format

Degree Grantor

Language

Repository Citation

Rights

Included in

Share

Browse