Award Date

8-1-2019

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Committee Member

Kazem Taghva

Second Committee Member

Laxmi Gewali

Third Committee Member

Jan Pedersen

Fourth Committee Member

Emma Regentova

Number of Pages

184

Abstract

Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents.

The main contributions of this work are:

• Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate.

• Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected.

• Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text.

• Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed.

Keywords

Google Web 1T; OCR; Optical Character Recognition; Post Processing; Reproducible Research; Text Alignment

Disciplines

Computer Sciences

File Format

pdf

Degree Grantor

University of Nevada, Las Vegas

Language

English

GOOGLE1T-9c4ebf5651c91.zip (110709 kB)
TREC5SQL-47e367291df01.zip (30811 kB)

Repository Citation

Fonseca Cacho, Jorge Ramon, "Improving OCR Post Processing with Machine Learning Tools" (2019). UNLV Theses, Dissertations, Professional Papers, and Capstones. 3722.
http://dx.doi.org/10.34917/16076262

Rights

IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/

Download

Included in

Computer Sciences Commons

COinS

Digital Scholarship@UNLV

UNLV Theses, Dissertations, Professional Papers, and Capstones

Improving OCR Post Processing with Machine Learning Tools

Award Date

Degree Type

Degree Name

Department

First Committee Member

Second Committee Member

Third Committee Member

Fourth Committee Member

Number of Pages

Abstract

Keywords

Disciplines

File Format

Degree Grantor

Language

Repository Citation

Rights

Included in

Browse

Digital Scholarship@UNLV

UNLV Theses, Dissertations, Professional Papers, and Capstones

Improving OCR Post Processing with Machine Learning Tools

Author

Award Date

Degree Type

Degree Name

Department

First Committee Member

Second Committee Member

Third Committee Member

Fourth Committee Member

Number of Pages

Abstract

Keywords

Disciplines

File Format

Degree Grantor

Language

Repository Citation

Rights

Included in

Share

Browse